Reputation: 4179
I have pdf's that I am reading into R. I am converting them into data.frame
using tabluizer::extract_tables
The PDF files have 6 columns/variables and can have multiple pages per document...fine, got that working. What I want to do is add a 7th column for the file name inside of my for loop, but failing as I get an error of:
Error in rbind(deparse.level, ...) : numbers of columns of arguments do not match
Here is my code:
for(i in 1:length(pdf.list)){
print(paste("Reading - ", pdf.list[i]))
cur.doc <- extract_tables(pdf.list[i])
for(j in 1:length(cur.doc)){
cur.doc.page <- cur.doc[[j]]
df$FileName = pdf.list[i]
df <- as.data.frame(cur.doc.page)
documents <- rbind(documents, df)
}
}
So I get the issue is my cbind() but I am not sure of a) why and b) how to fix. pdf.list[i]
gives the current file name.
UPDATE
This finally did it will all errors
documents <- data.frame()
error.page.df <- data.frame()
for(i in 1:length(pdf.list)){
print(paste("Reading file -", pdf.list[i]))
cur.doc <- extract_tables(pdf.list[i])
print(paste("There are", length(cur.doc), "pages in the current file."))
for(j in 1:length(cur.doc)){
cur.doc.page <- cur.doc[j]
print(
paste(
"Reading page -"
, j
, "There are"
, ncol(as.data.frame(cur.doc.page))
, "columns."
)
)
df <- as.data.frame(cur.doc.page)
df <- df[-1, ]
df <- df[, colSums(df != "") != 0]
df$FileName <- pdf.list[i]
tmp.col.names <- c(
"V1","V2","V3","V4","V6","FileName"
)
try(colnames(df) <- tmp.col.names, silent = T)
possible.error <- try(rbind(documents, df))
if(isTRUE(class(possible.error)=="try-error")) {
print(
paste(
"Could not insert page"
, j
, "for file -"
, pdf.list[i]
)
)
error.msg <- paste(
"Could not insert page"
, j
, "for file -"
, pdf.list[i]
)
error.page.df <- rbind(error.page.df, error.msg)
next
} else {
documents <-rbind(documents, df)
possible.error <- NA
}
}
}
Upvotes: 0
Views: 563
Reputation: 145755
It's hard to tell... I think you may have more bugs. Maybe for (j in 1:length(cur.doc))
, not 1:length(cur.doc[i])
. And you create df
but never use it... do you mean documents <- rbind(documents, df)
rather than rbind(documents, cur.doc.page)
?
Either way, I think you want to add the new column just to the current doc, not to the entire documents
data frame. The way it is coded now, you are adding a whole new column to documents
everytime through the inner loop. But rbind
requires you to have the same number of columns.
I assume you want to use df
, so add the column to df
before binding onto docs:
df$filename = pdf.list[i]
(You use pdf.list[j]
in your code, but it seems like it should be [i]
as in your text).
Like this:
documents <- data.frame()
for(i in 1:length(pdf.list)){
print(paste("Reading - ", pdf.list[i]))
cur.doc <- extract_tables(pdf.list[i])
for(j in 1:length(cur.doc)){
cur.doc.page <- cur.doc[[j]]
df <- as.data.frame(cur.doc.page)
df$FileName <- pdf.list[i]
documents <- rbind(documents, df)
}
}
Upvotes: 2