MCP_infiltrator
MCP_infiltrator

Reputation: 4179

Add file name as a column to a data.frame inside of a loop

I have pdf's that I am reading into R. I am converting them into data.frame using tabluizer::extract_tables

The PDF files have 6 columns/variables and can have multiple pages per document...fine, got that working. What I want to do is add a 7th column for the file name inside of my for loop, but failing as I get an error of:

Error in rbind(deparse.level, ...) : numbers of columns of arguments do not match

Here is my code:

  for(i in 1:length(pdf.list)){
  print(paste("Reading - ", pdf.list[i]))
  cur.doc <- extract_tables(pdf.list[i])
  for(j in 1:length(cur.doc)){
    cur.doc.page <- cur.doc[[j]]
    df$FileName = pdf.list[i]
    df <- as.data.frame(cur.doc.page)
    documents <- rbind(documents, df)
    }
  }

So I get the issue is my cbind() but I am not sure of a) why and b) how to fix. pdf.list[i] gives the current file name.

UPDATE

This finally did it will all errors

documents <- data.frame()
error.page.df <- data.frame()

for(i in 1:length(pdf.list)){
  print(paste("Reading file -", pdf.list[i]))
  cur.doc <- extract_tables(pdf.list[i])
  print(paste("There are", length(cur.doc), "pages in the current file."))
  for(j in 1:length(cur.doc)){
    cur.doc.page <- cur.doc[j]
    print(
      paste(
        "Reading page -"
        , j
        , "There are"
        , ncol(as.data.frame(cur.doc.page))
        , "columns."
        )
      )
    df <- as.data.frame(cur.doc.page)
    df <- df[-1, ]
    df <- df[, colSums(df != "") != 0]
    df$FileName <- pdf.list[i]
    tmp.col.names <- c(
      "V1","V2","V3","V4","V6","FileName"
    )
    try(colnames(df) <- tmp.col.names, silent = T)
    possible.error <- try(rbind(documents, df))
    if(isTRUE(class(possible.error)=="try-error")) { 
      print(
        paste(
          "Could not insert page"
          , j
          , "for file -"
          , pdf.list[i]
        )
      )
      error.msg <- paste(
        "Could not insert page"
        , j
        , "for file -"
        , pdf.list[i]
      )
      error.page.df <- rbind(error.page.df, error.msg)
      next 
    } else {
      documents <-rbind(documents, df)
      possible.error <- NA
    }
  }
}

Upvotes: 0

Views: 563

Answers (1)

Gregor Thomas
Gregor Thomas

Reputation: 145755

It's hard to tell... I think you may have more bugs. Maybe for (j in 1:length(cur.doc)), not 1:length(cur.doc[i]). And you create df but never use it... do you mean documents <- rbind(documents, df) rather than rbind(documents, cur.doc.page)?

Either way, I think you want to add the new column just to the current doc, not to the entire documents data frame. The way it is coded now, you are adding a whole new column to documents everytime through the inner loop. But rbind requires you to have the same number of columns.

I assume you want to use df, so add the column to df before binding onto docs:

df$filename = pdf.list[i]

(You use pdf.list[j] in your code, but it seems like it should be [i] as in your text).

Like this:

documents <- data.frame()
for(i in 1:length(pdf.list)){
  print(paste("Reading - ", pdf.list[i]))
  cur.doc <- extract_tables(pdf.list[i])
  for(j in 1:length(cur.doc)){
    cur.doc.page <- cur.doc[[j]]
    df <- as.data.frame(cur.doc.page)
    df$FileName <- pdf.list[i]
    documents <- rbind(documents, df)
  }
}

Upvotes: 2

Related Questions