pssguy
pssguy

Reputation: 3505

PDF scraping using R

I have been using the XML package successfully for extracting HTML tables but want to extend to PDF's. From previous questions it does not appear that there is a simple R solution but wondered if there had been any recent developments

Failing that, is there some way in Python (in which I am a complete Novice) to obtain and manipulate pdfs so that I could finish the job off with the R XML package

Upvotes: 10

Views: 4253

Answers (4)

psychemedia
psychemedia

Reputation: 5940

The heart of the tabula application that can extract tables from PDF documents is available as a simple command line Java application, tabula-extractor.

This Java app has been wrapped in R by the tabulizer package. Pass it the path to a PDF file and it will try to extract data tables for you and return them as data.

For an example, see When Documents Become Databases – Tabulizer R Wrapper for Tabula PDF Table Extractor.

Upvotes: 1

Rappster
Rappster

Reputation: 13080

Your might want to check out the text mining package tm. I recall that they implemented so called readers, and there also was one for PDFs.

Upvotes: 5

Richie Cotton
Richie Cotton

Reputation: 121077

AFAIK there isn't an easy way of turning PDF tables into something useful for data analysis. You can use the Data Science Toolkit's File to Text utility (R interface via the RDSTK package), then parse the resulting text. Be warned: the parsing is often non-trivial.


EDIT: There's a useful discussion of converting PDFs to XML on discerning.com. The short answer is that you will probably need to buy a commercial tool.

Upvotes: 4

Spacedman
Spacedman

Reputation: 94192

Extracting text from PDFs is hard, and nearly always requires lots of care.

I'd start with the command line tools such as pdftotext and see what they spit out. The problem is that PDFs can store the text in any order, can use awkward font encodings, and can do things like use ligature characters (the joined up 'ff' and 'ij' that you see in proper typesetting) to throw you.

pdftotext is installable on any Linux system...

Upvotes: 11

Related Questions