Reputation: 72731
I'm trying to PDF scrape a list of physician names. The file appears to be in mixed encoding.
When I copy/paste a single physician's name (page 51), I get this:
Dandona, Suklesh
If I paste just the jibberish part to a text file and try enca, I get:
enca -L none CHC_test.txt
Universal transformation format 8 bits; UTF-8
Which ain't it.
The wrinkle here that makes this not a duplicate of previous questions is that if I just view the file in a PDF viewer I can see the address. It's (typing it out): 1601 Main St Suite 306
So how do I convert the addresses in this file? enca
doesn't seem to take known text strings. I guess I could run every single supported encoding through iconv
programmatically and see if the result equals what I have typed out below. Since R has an iconv
interface I might do just that, but perhaps someone has a better solution?
I'm aware of the usual caveats about encoding: there's no way to know for sure, unicode is not an encoding, etc. I have read Joel, I promise. :-D
Upvotes: 0
Views: 101
Reputation: 8895
This is not an encoding issue, you're dealing with an obfuscated PDF, which is likely a deliberate measure to keep people paying for databases of this information. This is one of the features of transporting our documents around the Interwebs as programs in a Turing-complete language.
Your best bet is to render this to an image and then parse using OCR, which works nicely in my tests (using ImageMagick to convert to 300dpi PNGs and parsing them using cuneiform on Linux):
themel@kallisti: ~/so $ grep Street cuneiform-out.txt
Adoue Street
7930 Broadway Street Suite
6516 Broadway Street Suite
6516 Broadway Street Suite
218 East House Street
303 North Mckinney Street
826 South Meyer Street
Upvotes: 1