Reputation: 3505
I have been using the XML package successfully for extracting HTML tables but want to extend to PDF's. From previous questions it does not appear that there is a simple R solution but wondered if there had been any recent developments
Failing that, is there some way in Python (in which I am a complete Novice) to obtain and manipulate pdfs so that I could finish the job off with the R XML package
Upvotes: 10
Views: 4253
Reputation: 5940
The heart of the tabula application that can extract tables from PDF documents is available as a simple command line Java application, tabula-extractor.
This Java app has been wrapped in R by the tabulizer package. Pass it the path to a PDF file and it will try to extract data tables for you and return them as data.
For an example, see When Documents Become Databases – Tabulizer R Wrapper for Tabula PDF Table Extractor.
Upvotes: 1
Reputation: 13080
Your might want to check out the text mining package tm
. I recall that they implemented so called readers, and there also was one for PDFs.
Upvotes: 5
Reputation: 121077
AFAIK there isn't an easy way of turning PDF tables into something useful for data analysis. You can use the Data Science Toolkit's File to Text utility (R interface via the RDSTK package), then parse the resulting text. Be warned: the parsing is often non-trivial.
EDIT: There's a useful discussion of converting PDFs to XML on discerning.com. The short answer is that you will probably need to buy a commercial tool.
Upvotes: 4
Reputation: 94192
Extracting text from PDFs is hard, and nearly always requires lots of care.
I'd start with the command line tools such as pdftotext and see what they spit out. The problem is that PDFs can store the text in any order, can use awkward font encodings, and can do things like use ligature characters (the joined up 'ff' and 'ij' that you see in proper typesetting) to throw you.
pdftotext is installable on any Linux system...
Upvotes: 11