Parsing PDF by paragraph in R using pdftools

Question

I am trying to parse a PDF document by paragraph in R. I have the PDF saved on my local machine. Thus, please download the sample pdf from the Apple website.

require(pdftools)

apple <- pdf_text('apple.pdf')

apple[[26]]

The issue is that if we examine the 26th page, each line terminates with an ' '. This is no different than the between the end of the first paragraph (in italics) and the Overview and Highlights paragraph. In the PDF, it does appear that 2 lines are skipped, but the object in R doesn't reflect that.

I cannot figure out whether this is a function of this particular package, or whether in fact the conversion to text eliminates these paragraph markers. I haven't been able to set up import using other methods (ex. using the tm package)

treysp · Accepted Answer

I think it's an underlying property of the document (not of the general text conversion process or of pdftools).

If you use your mouse to select text across paragraph breaks, it doesn't pick up the blank lines, suggesting that they are part of the PDF's layout metadata and not the text itself (though I don't actually know anything about PDF file specs):

Your best bet may be coming up with heuristic rule-sets to identify paragraph breaks. I'm thinking something like:

Previous line ends with a period then
Paragraph title line is short, ends without a period, then
First sentence of paragraph starts with a capital letter and takes up the full line

Parsing PDF by paragraph in R using pdftools

Answers (1)

Related Questions