Trying to systematically extract information (text + table) from pdf in R

Question

For a project, I need to extract information from a PDF file that is not available anywhere else. As I'm talking about thousands of data, I do not want to type it manually since this is error-prone.

The PDF is structured like this

What I want to do is to basically map the variable name (at the beginning of each entry) with its different categories. The variable names do not follow any consistent pattern so there's no way to use regular expression or something like this.

This is the ideal dataset I would like to have:

     variable_name         variable_description   input_variable
1 VARIABLE_NUMBER1 VARIABLE_DESCRIPTION_NUMBER1 VARIABLE_NUMBER2
2 VARIABLE_NUMBER3 VARIABLE_DESCRIPTION_NUMBER3 VARIABLE_NUMBER4
                                                                           list_categories
1                                              1 = CATEGORY1; 2 = CATEGORY2; 3 = CATEGORY3
2 1 = CATEGORY1; 2 = CATEGORY2; 3 = CATEGORY3; 4 = CATEGORY4; 5 = CATEGORY5; 6 = CATEGORY6

In my PDF, this format is consistent from a certain subset of pages that I would manually have to type. As you can guess this pdf is a documentation file, so there's an introduction text that does not interest me, and then, from say page 46 to 110, it's just a list of entries like that of the image I gave.

So far, I could only manage to extract tables using the tabulizer package, but not the metadata above the table, as I'm not familiar neither with R, nor with this package. No problem with the package besides the fact that R imports the first row as a header but this is easily fixable.

What I thought about doing was somehow to import A) The tables as a whole B) The pdf as a whole as a text, and perform a text match, but I'm scared this yields inconsistencies.

The only information that interests me so far is the variable name and its categories, so it's not a big deal if the other information are not available but the more info I can extract the better.

Trying to systematically extract information (text + table) from pdf in R

Answers (1)

Related Questions