Carlo
Carlo

Reputation: 397

Skip lines/rows which create errors in fread R

I try to read a big file into r. While trying to read it, this error occurs. Even when i skip the first 800607 lines it doesn't disappear. I also tried to delete the line in the terminal with the command.

sed '800608d' filename.csv

It doesn't solved my problem. I would really appreciate if you could help me.

The original error i got from R is:

> data<-fread("filename.csv")
Read 2.0% of 34143409 rows
Error in fread("filename.csv") : 
Field 16 on line 800607 starts with quote (") but then has a problem. It can contain balanced unescaped quoted subregions but if it does it can't contain embedded \n as well. Check for unbalanced unescaped quotes: """The attorney for Martin's family, Benjamin Crump, says the evidence is ""irrelevant\"""" """".","NULL","NULL","NULL","NULL","NULL","NULL","NULL","Negative"
In addition: Warning message:
 In fread("filename.csv") :
Starting data input on line 8 and discarded previous non-empty line: done

Upvotes: 3

Views: 2345

Answers (1)

mjfred
mjfred

Reputation: 54

I'm currently in the middle of resolving this kind of issue myself. I'm not sure if this will work for all cases--let alone all of the files I'm working with myself. But for now I seem to be getting some successes with:

skip.list <- c()

for (i in 1:length(dir(input.dir))){ # i=3
  file <- dir(input.dir)[i]
  ingested.file <- NULL
  ingested.file <- try(fread(paste0(input.dir,file), header=T, stringsAsFactors=F))
  if (class(ingested.file)=="try-error") {
    error.line <-as.integer(sub(" .*","",sub(".*but line ","",as.character(ingested.file))))
    app.reviews.input <- try(fread(paste0(input.dir,file), header=T, stringsAsFactors=F,skip=error.line))
    if (class(ingested.file)=="try-error") {
      skip.list_by.downloads <- c(skip.list_by.downloads, file)
      next
    }
  }
}

I'm currently working with about 750 files of 1000 rows each--about 50 of which have the same issue. With this method however, I am able to read in 30 of those 50; the remaining 20 seem to have errors in multiple rows, but I am unable to specify multiple skip values.

If it were possible to specify more skips, then you could try a while-statement. i.e.

while (class(ingested.file)=="try-error") ... and then update the error.list as many times as is necessary automatically.

I hope this helps!

Upvotes: 2

Related Questions