JorisK
JorisK

Reputation: 11

Extract text from multiple PDF-files to a structured data table

I am new to this platform and I hope someone can help me.

I have imported some pdf files into Rstudio using the pdftools library. Now I want to make structured columns of this text. I just can't seem to get the structure right.

This is an example of one file added that I imported. I want to make the yellow shaded lines in a data table.

enter image description here

This is the outcome I would ultimately like to have.

enter image description here

Now I have entered the code below, but I can't get it into a data table.

library(pdftools)
library(stringr)
library(dplyr)

# load the PDF-files into Rstudio
files <- list.files(pattern = "pdf$", full.names = TRUE)

# make a list of the PDF-files
filestext <- lapply(files, pdf_text)

# remove "\n"
filestext <- str_split(filestext, pattern = "\n")

This is the result I get:

enter image description here

Does anyone know the easiest way to solve this?

Upvotes: 1

Views: 1005

Answers (1)

Michael Schultz
Michael Schultz

Reputation: 130

I would also give https://sensible.so a shot. We have some great documentation and a free plan just for projects like this. Plus, when you sign up there are some tutorials to help you understand how to extract different types of data. I bet you can have this extracted into a clean JSON object in no time.

Upvotes: -3

Related Questions