Friday, December 31, 2010

An analysis of 5000 PDF files

To be exact, 4949 PDF files were used for the analysis.

Methodology:
To collect information about the PDF/pdf files available on my system, the following command was run:

# find / -name "*.[pP][dD][fF]" -type f -exec ./test.sh {} \;

test.sh script:

echo " " >> index.dat
echo $1 >> index.dat
pdfinfo $1 >> index.dat

index.dat so generated was parsed into a postgresql database table with columns for title, author, creator, producer, keywords, subject, bytes, pages, creation_date, mod_date, page_size, pdf_version, encrypted, optimized and tagged field values returned by pdfinfo program for each PDF file.

PDF Version:
More than 3500 files were using version 1.3 or 1.4, introduced a decade ago. The file counts for all major versions were as follows:

PDF Version (with year File
of its introduction): count:
1.0 (1993) 3
1.1 (1994) 196
1.2 (1996) 720
1.3 (1999) 1622
1.4 (2001) 1912
1.5 (2003) 260
1.6 (2005) 187
1.7 (2006) 49

The PDF file format has now been published as an open standard in 2008, and Version 2.0 is in the offing. It may safely be predicted that Version 2.0, or one of the subsequent 2.* stable releases, would stabilize by 2020.
For those wondering about the three files old files on my system using version 1.0, one of them is a copy of the "Hacker's Diet", by John Walker, that can still be downloaded from: http://fourmilab.ch/hackdiet/hdpdf.zip The pdfinfo for this file is as follows:

Author: John Walker
Creator: LaTeX
CreationDate: Monday, December 20, 1993 at 22:57
Tagged: no
Pages: 338
Encrypted: no
Page size: 342 x 476 pts
File size: 2007166 bytes
Optimized: no
PDF version: 1.0

Pages:
The 4949 PDF files have a total page count of 2,80,777. More than half of the PDF files have just a couple of pages. File counts for those with less than 10 pages:

Number of File
pages: count:
1 893
2 473
3 332
4 227
5 192
6 167
7 146
8 130
9 78

About 1/3rd of the total have less than 101 pages. The average page count was 33 for files with page count between and 10 and 100. The page range statistics:

Number of File
pages between: count:
1-10 2722
10-100 1579
100-1000 632
1000+ 16

Page size:
A4 is the standard, followed by letter size. Due to the minor difference in the points scale used, no summary is given here.
 
Page size: File
count:
595 x 842 pts (A4) 1348
612 x 792 pts (letter) 1315
595.276 x 841.89 pts (A4) 510
468 x 648 pts 92
595.28 x 841.89 pts (A4) 89
595.22 x 842 pts (A4) 44
612 x 1008 pts 31
595 x 964 pts 30

Author:
The Author field was blank in half of the PDF files. The good news is that the Author names are given even consistently across the files in the remaining half. A few example names found:

Author (File count):
Benjamin Franklin (2)
D.S.KOTHARI (1)
David Carlisle, carlisle@cs.man.ac.uk (5)
Doyle, Arthur Conan (2)
Intel Corporation (2)
Lawrence Lessig (1)
The Unicode Consortium (5)

Title:
Some good title descriptions found:

Title (File count):
ACHARYA JAGADIS CHUNDER BOSE - HIS LIFE AND WORK (1858-1937) (1)
An Inquiry into the Nature and Causes of the Wealth of Nations (1)
APOLLO 8 MISSION REPORT FEB 1969 (1)
ARBITRATION LAW IN INDIA (1)
Free Culture (1)
Intel® Desktop Board D945GCLF Product Guide (1)
The Articles of Confederation (1)

Many files had numbers with "Chapter", "Form", "Annexure", "Volume" or other prefixes that do not help to identify the file. A title like "Animal Farm by George Orwell" duplicates the Author field. As an example of good style, I could cite the five PDF books authored by David Carlisle, carlisle@cs.man.ac.uk available on the system. The Author name is written consistently across the PDF files, and the titles accurately describe the content:

The afterpage package
The delarray package
The hhline package
The indentfirst package
The xr package

Subject:
In most files (4564 to be precise), this field was left blank. PDF files from http://www.feedbooks.com/ give the subject, making it easy to index thus:

Subject (File count):
Non-Fiction (2)
Non-Fiction, Essay (1)
Non-Fiction, Essay, Collections (2)
Non-Fiction, Essay, Politics (2)
Non-Fiction, History, Essay, Politics (1)
Non-Fiction, Philosophy (4)
Novels, Adventure (6)
Novels, History, Biography (1)
Novels, History, Romance, Adventure (1)
Novels, History, War (1)
Novels, Horror, Fantasy (1)
Novels, Philosophy, Science Fiction, Politics (1)
Short Fiction (1)
Short Fiction, Crime/Mystery, Collections (1)
Short Fiction, Horror (1)
Short Fiction, Science Fiction (8)
Short Fiction, Science Fiction, Collections (1)
Short Fiction, Young Readers, Fantasy, Collections (3)

Keywords:
4850 files did not give any keywords. The few that give, help a lot in classifying the files. Some example keywords found listed:

[Org ] Keywords (File count):
[nVIDIA] GeForce 6, Video Processing Technology (1)
[ilugc ] FOSS, Free, Open, BSD, Linux, Software (1)
[pakin ] attachments; annotations; PDF; LaTeX; package; automatic; files (1)
[us-con] constitution independence hall philadelphia (1)
[bnhs ] wader, india, point calimere, india, bird, waterfowl, russia, monitoring, wetlands, india, cranes, important areas, waterbirds around the world (1)

Document Restrictions:
The vast majority of PDF files imposed no restrictions. The details:

Document restriction parameters: File
count:
no 4292
yes (print:no copy:no change:no addNotes:no) 570
yes (print:yes copy:no change:no addNotes:no) 53
yes (print:yes copy:no change:no addNotes:yes) 4
yes (print:yes copy:yes change:no addNotes:no) 20
yes (print:yes copy:yes change:no addNotes:yes) 2
yes (print:yes copy:yes change:yes addNotes:yes) 8

Information relating to copyright is mostly part of the content - it should ideally form part of the file metadata too. Many of the document restrictions cease to apply after lapse of statutory period of copyright, and it would help to have particulars about the owner of the copyright, licensing terms and conditions, along with full details about the source of publication.

Optimization:
PDF files are either linear (optimized) or non-linear (not optimized). Linear files are basically optimized for the web, so that the pages can be viewed without waiting for the whole file to download as is the case with non-linear files. The statistics for optimization were:

Optimization (File count):
false (2934)
true (2015)

Content:
Of course, pdfinfo doesn't help here - one has to read the file to judge content. 500+ files were from www.arvindguptatoys.com and 100+ were from www.gandhiserve.org - I recommend both sites for useful reading :)

References:
http://en.wikipedia.org/wiki/Portable_Document_Format
http://www.adobe.com