How do I determine the encoding of text if I already know what a sample should be?

Question

I'm trying to PDF scrape a list of physician names. The file appears to be in mixed encoding.

When I copy/paste a single physician's name (page 51), I get this:

Dandona, Suklesh 

If I paste just the jibberish part to a text file and try enca, I get:

enca -L none CHC_test.txt 
Universal transformation format 8 bits; UTF-8

Which ain't it.

The wrinkle here that makes this not a duplicate of previous questions is that if I just view the file in a PDF viewer I can see the address. It's (typing it out): 1601 Main St Suite 306

So how do I convert the addresses in this file? enca doesn't seem to take known text strings. I guess I could run every single supported encoding through iconv programmatically and see if the result equals what I have typed out below. Since R has an iconv interface I might do just that, but perhaps someone has a better solution?

I'm aware of the usual caveats about encoding: there's no way to know for sure, unicode is not an encoding, etc. I have read Joel, I promise. :-D

themel · Accepted Answer

This is not an encoding issue, you're dealing with an obfuscated PDF, which is likely a deliberate measure to keep people paying for databases of this information. This is one of the features of transporting our documents around the Interwebs as programs in a Turing-complete language.

Your best bet is to render this to an image and then parse using OCR, which works nicely in my tests (using ImageMagick to convert to 300dpi PNGs and parsing them using cuneiform on Linux):

themel@kallisti: ~/so $ grep Street cuneiform-out.txt 
Adoue Street 
7930 Broadway Street Suite 
6516 Broadway Street Suite 
6516 Broadway Street Suite 
218 East House Street 
303 North Mckinney Street 
826 South Meyer Street

How do I determine the encoding of text if I already know what a sample should be?

Answers (1)

Related Questions