Yanis
Yanis

Reputation: 1

Trying to systematically extract information (text + table) from pdf in R

For a project, I need to extract information from a PDF file that is not available anywhere else. As I'm talking about thousands of data, I do not want to type it manually since this is error-prone.

The PDF is structured like this

What I want to do is to basically map the variable name (at the beginning of each entry) with its different categories. The variable names do not follow any consistent pattern so there's no way to use regular expression or something like this.

This is the ideal dataset I would like to have:

     variable_name         variable_description   input_variable
1 VARIABLE_NUMBER1 VARIABLE_DESCRIPTION_NUMBER1 VARIABLE_NUMBER2
2 VARIABLE_NUMBER3 VARIABLE_DESCRIPTION_NUMBER3 VARIABLE_NUMBER4
                                                                           list_categories
1                                              1 = CATEGORY1; 2 = CATEGORY2; 3 = CATEGORY3
2 1 = CATEGORY1; 2 = CATEGORY2; 3 = CATEGORY3; 4 = CATEGORY4; 5 = CATEGORY5; 6 = CATEGORY6

In my PDF, this format is consistent from a certain subset of pages that I would manually have to type. As you can guess this pdf is a documentation file, so there's an introduction text that does not interest me, and then, from say page 46 to 110, it's just a list of entries like that of the image I gave.

So far, I could only manage to extract tables using the tabulizer package, but not the metadata above the table, as I'm not familiar neither with R, nor with this package. No problem with the package besides the fact that R imports the first row as a header but this is easily fixable.

What I thought about doing was somehow to import A) The tables as a whole B) The pdf as a whole as a text, and perform a text match, but I'm scared this yields inconsistencies.

The only information that interests me so far is the variable name and its categories, so it's not a big deal if the other information are not available but the more info I can extract the better.

Upvotes: 0

Views: 125

Answers (1)

K J
K J

Reputation: 11939

I would recommend that since R pdftools is based on poppler that it may be easier to gather your data via shell execute so that you split the problem into two parts for secondary parsing.

poppler\22.02>pdftotext -nopgbrk -layout -x 0 -W 900 -y 0 -H 100 5aVh4.pdf ->gathers1.txt
poppler\22.02>pdftotext -nopgbrk -layout -x 0 -W 900 -y 250 -H 100 5aVh4.pdf ->>gathers1.txt
poppler\22.02>pdftotext -nopgbrk -layout -x 0 -W 900 -y 170 -H 100 5aVh4.pdf ->gathers2.txt
poppler\22.02>pdftotext -nopgbrk -layout -x 0 -W 900 -y 420 -H 150 5aVh4.pdf ->>gathers2.txt

poppler\22.02>type gathers1.txt

VARIABLE_NUMBER1
VARIABLE DESCRIPTION_NUMBER1
Input variable = VARIABLE_NUMBER2

VARIABLE NUMBER3
VARIABLE_DESCRIPTION_NUMBER3
Input variable = VARIABLE_NUMBER4

poppler\22.02>type gathers2.txt

01   CATEGORY 1
02   CATEGORY 2
03   CATEGORY 3
01   CATEGORY 1
02   CATEGORY 2
03   CATEGORY 3
04   CATEGORY 4
05   CATEGORY 5
06   CATEGORY 6

poppler\22.02>

This will then be easier to manipulate as 2 different text layouts, clearly my values may be different to yours as its an emulation of your unseen PDF.

Upvotes: 0

Related Questions