Reputation: 495
I am currently scraping some text data from several PDFs using readPDF() function in the tm package. This all works very well and in most cases the encoding seems to be "latin1" - in some, however, it is not. Is there a good way in R to check character encodings? I found the functions is.utf8() and is.local() in the tau package but that obviously only gets me so far.
Thanks.
Upvotes: 3
Views: 623
Reputation: 90213
I don't know much about R. But I have now poked a bit at CRAN, to see what the mentioned tm and tau packages are.
So tm is for text mining, and for PDF reading it requires and relies on the pdftotext
utility from Poppler. I had at first the [obviously wrong] impression, that your mentioned readPDF() function was doing some low-level, library-based access to PDF objects directly in the PDF file.... How wrong I was! Turns out it 'only' looks at the text output of the pdftotext
commandline tool.
Now this explains why you'll probably not succeed in reading any of the PDFs which do use more complex font encodings than the 'simple' Latin1.
I'm afraid, the reason for your problem is that currently Poppler and pdftotext
are simply not yet able to handle these.
Maybe you're better off to ask the tm maintainers for a feature request: :-)
Upvotes: 1
Reputation: 90213
The PDF specification defines these encodings for simple fonts (each of which can include a maximum of 256 character shapes) for latin text that should be predefined in any conforming reader:
/StandardEncoding
/MacRomanEncoding
/PDFDocEncoding
/WinAnsiEncoding
/MacExpertEncoding
Then there are 2 specific encodings for symbol fonts:
Also, fonts can have built-in encodings, which may deviate any way their creator wanted from a standard encoding (f.e. also used for differences encoding when embedded standard fonts are subsetted).
So in order to correctly interpret a PDF file, you'll have to lookup each of the font encodings of the fonts used, and you must to take into account any /Encoding
using a /Differences
array too.
However, the overall task is still quite simple for simple fonts. The PDF viewer program just needs to map 1:1 "each one of a seqence of bytes I see that's meant to represent a text string" to "exactly one glyph for me to draw which I can lookup in the encoding table".
For composite, CID-keyed fonts (which may contain many thousands of character shapes), the lookup/mapping for the viewer program for "this is the sequence of bytes I see that I'm supposed to draw as text" to "this is the sequence of glyph shapes to draw" is no longer 1:1. Here, a sequence of one or more bytes needs to be decoded to select each one glyph from the CIDFont.
And to help this CIDFont decoding, there need to be CMap structures around. CMaps define mappings from Unicode encodings to character collections. The PDF specification defines at least 5 dozen CMaps -- and their standard names -- for Chinese, Japanese and Korean language fonts. These pre-defined CMaps need not be embedded in the PDF (but the conforming PDF reader needs to know how to handle them correctly). But there are (of course) also custom CMaps which may have been generated 'on the fly' when the PDF-creating application wrote out the PDF. In that case the CMap needs to be embedded in the PDF file.
All details about these complexities are layed down in the official PDF-1.7 specification.
Upvotes: 1