Reputation: 545
I'm using R to pull data from PDFs and so far it has been going well. I just opened up a new batch of PDFs and saw that I have to figure out how to account for empty cells. I haven't found a way to do this, and I have hundreds of pages that I need to go through.
I've included some sample data. I haven't found a way to attach the PDFs here, and these are not posted on the web anywhere. I saved df
as a CSV, then copied and pasted that into a word document which I saved as a CSV for this example. Screenshot attached as well.
library(pdftools)
library(tidyverse)
# Example data
df <- data.frame("rows" = c("row1", "row2", "row3", "row4", "row5", "row6", "row7", "row8", "row9", "row10"),
"col1" = c(1, 2, "", 4, 5, 6, 7, 8, 9, 10),
"col2" = c(1, 2, 3, 4, "", "", 7, 8, 9, ""),
"col3" = c(1, 2, "", 4, 5, 6, 7, 8, 9, 10),
"col4" = c(1, 2, 3, 4, 5, 6, 7, "", 9, 10),
"col5" = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10),
"col6" = c(1, 2, "", "", 5, 6, 7, "", 9, 10),
"col7" = c(1, 2, 3, 4, 5, "", 7, 8, 9, 10),
"col8" = c(1, "", 3, 4, 5, 6, 7, "", 9, 10),
"col9" = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
)
# Save example data, then save as a PDF outside of R.
# write_csv(df, "sample_data.csv")
# read in the PDF
pdf_file <- pdf_text("sample_data.pdf")
data <- pdf_file[1]
data <- trimws(data)
data <- strsplit(data, "\r\n")
data <- data[[1]]
data <- str_split_fixed(data, " {2,}", 10) ## I think this is the step that needs to change
data <- data.frame(data, stringsAsFactors = FALSE)
# Print out outs of the data for reference.
> data
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
1 rows col1 col2 col3 col4 col5 col6 col7 col8 col9
2 row1 1 1 1 1 1 1 1 1 1
3 row2 2 2 2 2 2 2 2 2
4 row3 3 3 3 3 3 3
5 row4 4 4 4 4 4 4 4 4
6 row5 5 5 5 5 5 5 5 5
7 row6 6 6 6 6 6 6 6
8 row7 7 7 7 7 7 7 7 7 7
9 row8 8 8 8 8 8 8
10 row9 9 9 9 9 9 9 9 9 9
11 row10 10 10 10 10 10 10 10 10
df
rows col1 col2 col3 col4 col5 col6 col7 col8 col9
1 row1 1 1 1 1 1 1 1 1 1
2 row2 2 2 2 2 2 2 2 2
3 row3 3 3 3 3 3 3
4 row4 4 4 4 4 4 4 4 4
5 row5 5 5 5 5 5 5 5 5
6 row6 6 6 6 6 6 6 6
7 row7 7 7 7 7 7 7 7 7 7
8 row8 8 8 8 8 8 8
9 row9 9 9 9 9 9 9 9 9 9
10 row10 10 10 10 10 10 10 10 10
UPDATE: Adding dput(pdf_file)
> dput(pdf_file)
"rows col1 col2 col3 col4 col5 col6 col7 col8 col9\r\nrow1 1 1 1 1 1 1 1 1 1\r\nrow2 2 2 2 2 2 2 2 2\r\nrow3 3 3 3 3 3 3\r\nrow4 4 4 4 4 4 4 4 4\r\nrow5 5 5 5 5 5 5 5 5\r\nrow6 6 6 6 6 6 6 6\r\nrow7 7 7 7 7 7 7 7 7 7\r\nrow8 8 8 8 8 8 8\r\nrow9 9 9 9 9 9 9 9 9 9\r\nrow10 10 10 10 10 10 10 10 10\r\n"
You can see that there is a difference between df
and data
at this point. I've tried playing around with a few things and I haven't been able to make anything work well enough to post here. I tried using some if/else logic to say that if there were 3 or more spaces, insert NA, but that just caused a bunch of errors so I abandoned that approach. My goal is to get the data as close to df as possible.
Upvotes: 3
Views: 1196
Reputation: 389215
Try using read.fwf
as a fixed-width file.
data <- pdf_file[1]
data <- trimws(data)
data <- strsplit(data, "\r\n")
data <- data[[1]]
writeLines(data, 'temp.txt')
result <- read.fwf('temp.txt', c(11, 2, rep(8, 8)), skip = 1, strip.white = TRUE)
names(result) <- scan(text = readLines('temp.txt', n = 1), what = character())
result
# rows col1 col2 col3 col4 col5 col6 col7 col8 col9
#1 row1 1 1 1 1 1 1 1 1 1
#2 row2 2 2 2 2 2 2 2 NA 2
#3 row3 NA 3 NA 3 3 NA 3 3 3
#4 row4 4 4 4 4 4 NA 4 4 4
#5 row5 5 NA 5 5 5 5 5 5 5
#6 row6 6 NA 6 6 6 6 NA 6 6
#7 row7 7 7 7 7 7 7 7 7 7
#8 row8 8 8 8 NA 8 NA 8 NA 8
#9 row9 9 9 9 9 9 9 9 9 9
#10 row10 10 NA 10 10 10 10 10 10 10
Upvotes: 0
Reputation: 170
This looks like a good scenario to use the tabulizer
package. It works really well when there are nicely formatted tables like this in the PDF. See the vignette. The best function here for you would be tabulizer::extract_tables
. It should also recognize the blank spaces as empty values assuming the PDFs are all well formatted like this.
Upvotes: 0