Friday, December 31, 2010

An analysis of 5000 PDF files

To be exact, 4949 PDF files were used for the analysis.

Methodology:
To collect information about the PDF/pdf files available on my system, the following command was run:

# find / -name "*.[pP][dD][fF]" -type f -exec ./test.sh {} \;

test.sh script:

echo " " >> index.dat
echo $1 >> index.dat
pdfinfo $1 >> index.dat

index.dat so generated was parsed into a postgresql database table with columns for title, author, creator, producer, keywords, subject, bytes, pages, creation_date, mod_date, page_size, pdf_version, encrypted, optimized and tagged field values returned by pdfinfo program for each PDF file.

PDF Version:
More than 3500 files were using version 1.3 or 1.4, introduced a decade ago. The file counts for all major versions were as follows:

PDF Version (with year File
of its introduction): count:
1.0 (1993) 3
1.1 (1994) 196
1.2 (1996) 720
1.3 (1999) 1622
1.4 (2001) 1912
1.5 (2003) 260
1.6 (2005) 187
1.7 (2006) 49

The PDF file format has now been published as an open standard in 2008, and Version 2.0 is in the offing. It may safely be predicted that Version 2.0, or one of the subsequent 2.* stable releases, would stabilize by 2020.
For those wondering about the three files old files on my system using version 1.0, one of them is a copy of the "Hacker's Diet", by John Walker, that can still be downloaded from: http://fourmilab.ch/hackdiet/hdpdf.zip The pdfinfo for this file is as follows:

Author: John Walker
Creator: LaTeX
CreationDate: Monday, December 20, 1993 at 22:57
Tagged: no
Pages: 338
Encrypted: no
Page size: 342 x 476 pts
File size: 2007166 bytes
Optimized: no
PDF version: 1.0

Pages:
The 4949 PDF files have a total page count of 2,80,777. More than half of the PDF files have just a couple of pages. File counts for those with less than 10 pages:

Number of File
pages: count:
1 893
2 473
3 332
4 227
5 192
6 167
7 146
8 130
9 78

About 1/3rd of the total have less than 101 pages. The average page count was 33 for files with page count between and 10 and 100. The page range statistics:

Number of File
pages between: count:
1-10 2722
10-100 1579
100-1000 632
1000+ 16

Page size:
A4 is the standard, followed by letter size. Due to the minor difference in the points scale used, no summary is given here.
 
Page size: File
count:
595 x 842 pts (A4) 1348
612 x 792 pts (letter) 1315
595.276 x 841.89 pts (A4) 510
468 x 648 pts 92
595.28 x 841.89 pts (A4) 89
595.22 x 842 pts (A4) 44
612 x 1008 pts 31
595 x 964 pts 30

Author:
The Author field was blank in half of the PDF files. The good news is that the Author names are given even consistently across the files in the remaining half. A few example names found:

Author (File count):
Benjamin Franklin (2)
D.S.KOTHARI (1)
David Carlisle, carlisle@cs.man.ac.uk (5)
Doyle, Arthur Conan (2)
Intel Corporation (2)
Lawrence Lessig (1)
The Unicode Consortium (5)

Title:
Some good title descriptions found:

Title (File count):
ACHARYA JAGADIS CHUNDER BOSE - HIS LIFE AND WORK (1858-1937) (1)
An Inquiry into the Nature and Causes of the Wealth of Nations (1)
APOLLO 8 MISSION REPORT FEB 1969 (1)
ARBITRATION LAW IN INDIA (1)
Free Culture (1)
Intel® Desktop Board D945GCLF Product Guide (1)
The Articles of Confederation (1)

Many files had numbers with "Chapter", "Form", "Annexure", "Volume" or other prefixes that do not help to identify the file. A title like "Animal Farm by George Orwell" duplicates the Author field. As an example of good style, I could cite the five PDF books authored by David Carlisle, carlisle@cs.man.ac.uk available on the system. The Author name is written consistently across the PDF files, and the titles accurately describe the content:

The afterpage package
The delarray package
The hhline package
The indentfirst package
The xr package

Subject:
In most files (4564 to be precise), this field was left blank. PDF files from http://www.feedbooks.com/ give the subject, making it easy to index thus:

Subject (File count):
Non-Fiction (2)
Non-Fiction, Essay (1)
Non-Fiction, Essay, Collections (2)
Non-Fiction, Essay, Politics (2)
Non-Fiction, History, Essay, Politics (1)
Non-Fiction, Philosophy (4)
Novels, Adventure (6)
Novels, History, Biography (1)
Novels, History, Romance, Adventure (1)
Novels, History, War (1)
Novels, Horror, Fantasy (1)
Novels, Philosophy, Science Fiction, Politics (1)
Short Fiction (1)
Short Fiction, Crime/Mystery, Collections (1)
Short Fiction, Horror (1)
Short Fiction, Science Fiction (8)
Short Fiction, Science Fiction, Collections (1)
Short Fiction, Young Readers, Fantasy, Collections (3)

Keywords:
4850 files did not give any keywords. The few that give, help a lot in classifying the files. Some example keywords found listed:

[Org ] Keywords (File count):
[nVIDIA] GeForce 6, Video Processing Technology (1)
[ilugc ] FOSS, Free, Open, BSD, Linux, Software (1)
[pakin ] attachments; annotations; PDF; LaTeX; package; automatic; files (1)
[us-con] constitution independence hall philadelphia (1)
[bnhs ] wader, india, point calimere, india, bird, waterfowl, russia, monitoring, wetlands, india, cranes, important areas, waterbirds around the world (1)

Document Restrictions:
The vast majority of PDF files imposed no restrictions. The details:

Document restriction parameters: File
count:
no 4292
yes (print:no copy:no change:no addNotes:no) 570
yes (print:yes copy:no change:no addNotes:no) 53
yes (print:yes copy:no change:no addNotes:yes) 4
yes (print:yes copy:yes change:no addNotes:no) 20
yes (print:yes copy:yes change:no addNotes:yes) 2
yes (print:yes copy:yes change:yes addNotes:yes) 8

Information relating to copyright is mostly part of the content - it should ideally form part of the file metadata too. Many of the document restrictions cease to apply after lapse of statutory period of copyright, and it would help to have particulars about the owner of the copyright, licensing terms and conditions, along with full details about the source of publication.

Optimization:
PDF files are either linear (optimized) or non-linear (not optimized). Linear files are basically optimized for the web, so that the pages can be viewed without waiting for the whole file to download as is the case with non-linear files. The statistics for optimization were:

Optimization (File count):
false (2934)
true (2015)

Content:
Of course, pdfinfo doesn't help here - one has to read the file to judge content. 500+ files were from www.arvindguptatoys.com and 100+ were from www.gandhiserve.org - I recommend both sites for useful reading :)

References:
http://en.wikipedia.org/wiki/Portable_Document_Format
http://www.adobe.com

Wednesday, July 28, 2010

A black billed magpie's song

[Sighted at Chennai in April, 2010]

video

Sunday, February 21, 2010

CSX v. SEBI



Universal Law Publishing Co has published the Third Edition of M.V. Pylee's "Constitutional Amendments in India", in 2010, with full text of all Acts, Statement of Objects and Reasons and legislative history. The Constitution of India has so far witnessed Ninety-Four Amendments in its 60 years of operation.

Worldwide, the most significant constitutional reform we hear about is about having more open societies, where the citizens actively participate in governance. George Soros spoke elaborately about the need for open societies, and a transcript of his lecture is at http://www.soros.org/resources/multimedia/sorosceu_20091112/capitalism_transcript The core problem is summarised by Soros in these words:


According to the modern concept of sovereignty, the natural resources of a country belong to the people of that country, but governments, which are supposed to be agents of the people, put their own interests ahead of the interests of the people whom they are supposed to represent and engage in all sorts of corrupt practices. On the opposite side, the managements of the international oil and mining companies represent the interests of the companies all too well. They used to go so far as to bribe governments in order to obtain concessions. Willing takers and givers of bribes are the root cause of the resource curse.

Once I became aware of the agency problem, I discovered it everywhere.

Communism failed because of the agency problem. Karl Marx's proposition-from everybody according to their ability and to everybody according to their needs-was a very attractive idea, but the communist rulers put their own interests ahead of the interests of the people.

The agency problem is also the bane of representative democracy: the elected representatives use their powers for their own interests to the detriment of the common interest.


The Internet, mobile phones and other electronic networks widely available today could make direct democracy possible. However, direct participation in policy and law making sans inter meddlers and representatives needs to be tried and tested, before it could be realised through constitutional changes.

The most veritable testing grounds for direct participation of people is readily available at most stock exchanges. It is no accident that the stock markets and legislative assemblies have advanced forward in parallel replacing older forms of trade and governance.

The Bombay Stock Exchange originated as "The Native Share & Stock Brokers' Association" in 1875, making it the oldest stock exchange in Asia, continuously existing for more than 135 years. It was established soon after the East India Company, was dissolved and liquidated in 1874 by passing of the East India Stock Dividend Redemption Act, after the Government of India Act, 1858 closed the company rule.

The Coimbatore Stock Exchange came up very recently. CSX developed in the light of floor based trading but after electronic terminals became common place after 2000, the floors collapsed. Terminal based trading changed the future for CSX. [Ref: http://www.sebi.gov.in/courtorders/madras.html etc.] The next big development in terminal based trading might be when the brokers feel the heat and terminate.

The future of grass root democracy is nascent in the air. The success of terminal based trading should encourage terminal based voting and law making. Constitutional history could be in the making again.

Saturday, February 06, 2010

Pups: Not found wanting ...







The puppy litter made their home under a staircase, that's dark most of the time. My visit to their nook woke them up. The seven snuggled together making themselves warm and comfortable. They dozed off quickly sensing no harm from me.

Saturday, January 16, 2010

Annular solar eclipse over Chennai on 2010-01-15







A javascript eclipse calculator is available at http://www.chris.obyrne.com/Eclipses/calculator.html

The metrics for the partial annular eclipse for Chennai (13° 04' N, 80° 17' E. Time Zone 5:45 E) were:
Start of eclipse: 11:40:26.8
Mid eclipse: 13:45:36.1
End of partial eclipse: 15:30:10.8
Magnitude at mid eclipse: 0.89257
Ratio of size of moon/sun: 0.917