August 16, 2007
PDFtoTEXT: Font Encoding Issues
While working with PDFtoText, a great little command line program for extracting the TEXT from a PDF file, I ran into a text conversion/extraction problem while working from a client's PDF. In the past, I've successfully used PDFtoText to dump mountains of text from internal PDF reports, which make it much easier to munge with Perl or your script language of choice.
When I ran PDFtoText on my client's PDF, I kept getting undecipherable ASCII text in my output file. After searching a few forums and web sites, it appears that this problem can be caused by: (1) font subsetting, where only a portion of the embedded font is included with the PDF, and/or (2) custom encoding for the generated font subsets which depend on the sequence of the requested glyphs as they appear in the input stream.
Since I couldn't find a reliable method of extracting the text, I ended up going back to the client to get the data in a different format.
The URLs listed below do a pretty good job explaining why it can be difficult to extract text from a PDF document.
- PDF to HTML paper: [PDF]
August 08, 2007
IRS Updates their Procedures for Filing 1099 Electronically
The IRS published their new procedures for filing 1099 electronically for tax year 2007. This is normally documented in IRS Publication 1220, but you can find the same information in IRB 2007-30. [html] [pdf]
If you new filing 1099s electronically, wikipedia has a decent article to get you up to speed.
*UPDATE* 29 AUG 2007
IRS has published an updated version of IRS Publication 1220, which covers electronic filing of 1099s for tax year 2007. [IRS PDF Download site]