While working with PDFtoText, a great little command line program for extracting the TEXT from a PDF file, I ran into a text conversion/extraction problem while working from a client's PDF.  In the past, I've successfully used PDFtoText to dump mountains of text from internal PDF reports, which make it much easier to munge with Perl or your script language of choice.

When I ran PDFtoText on my client's PDF, I kept getting undecipherable ASCII text in my output file.   After searching a few forums and web sites, it appears that this problem can be caused by: (1) font subsetting, where only a portion of the embedded font is included with the PDF, and/or (2) custom encoding for the generated font subsets which depend on the sequence of the requested glyphs as they appear in the input stream.

Since I couldn't find a reliable method of extracting the text, I ended up going back to the client to get the data in a different format.

The URLs listed below do a pretty good job explaining why it can be difficult to extract text from a PDF document.

  • PDF to HTML paper: [PDF]