R data.table fread cannot read in irregular column lengths when the larger rows do not appear early in a file

Question

I want to read in a csv file using the data.table function fread. The csv file has a large number of records with 9 columns and then subsequently a large number of records with 10 columns. Using the argument fill=TRUE does not solve the issue. Here is some example data that demonstrates my problem:

library(data.table)
short <- rep("1,1,1", 1000)
long <- rep("1,1,1,2", 1000)
write(c(short, long), "shortLong.csv")
write(c(long, short), "longShort.csv")

When I read in the file that has the short column lengths and then the long column lengths, I get this error:

fread("shortLong.csv", fill=TRUE)
Error in fread("shortLong.csv", fill = TRUE) : 
  Expecting 3 cols, but line 1001 contains text after processing all cols. Try again with fill=TRUE. Another reason could be that fread's logic in distinguishing one or more fields having embedded sep=',' and/or (unescaped) '
' characters within unbalanced unescaped quotes has failed. If quote='' doesn't help, please file an issue to figure out if the logic could be improved

However, when I read the file that has the long column lengths and then the short column lengths, it reads in the file with no issue, and fills in the missing values in the short columns with NA, which is what I want:

fread("longShort.csv", fill=TRUE)
      V1 V2 V3 V4
   1:  1  1  1  2
   2:  1  1  1  2
   3:  1  1  1  2
   4:  1  1  1  2
   5:  1  1  1  2
  ---            
1996:  1  1  1 NA
1997:  1  1  1 NA
1998:  1  1  1 NA
1999:  1  1  1 NA
2000:  1  1  1 NA

This error seems to be caused by the large number of short columns before the long columns, as when I mixed the short and long together there was no issue:

mixed <- rep(c("1,1,1", "1,1,1,2"), 1000)
write(mixed, "mixed.csv")
fread("mixed.csv", fill=TRUE)
      V1 V2 V3 V4
   1:  1  1  1 NA
   2:  1  1  1  2
   3:  1  1  1 NA
   4:  1  1  1  2
   5:  1  1  1 NA
  ---            
1996:  1  1  1  2
1997:  1  1  1 NA
1998:  1  1  1  2
1999:  1  1  1 NA
2000:  1  1  1  2

I suspect this difference in behavior may be because fread looks ahead to allocate columns but does not look through the whole file and sets a maximum column number based on the longest seen (I am not very familiar with the inner workings of the function however).

Is there any way to read in my data correctly using fread? I think I could probably put some hacky solution together and bypass fread, but I love the performance and I don't want to drastically slow down my code. Thanks!

G. Grothendieck · Accepted Answer

Note that what you have is not a csv file since it has no header. If we add a header it will work. First use fread to read it in as a single field per line giving the character vector Lines. From that compute the maximum number of fields n. Finally re-read Lines after prefixing it with a header.

Lines <- fread("shortLong.csv", sep = "")[[1]]
n <- max(count.fields(textConnection(Lines), sep = ","))
fread(text = c(toString(1:n), Lines), header = TRUE, fill = TRUE)

R data.table fread cannot read in irregular column lengths when the larger rows do not appear early in a file

Answers (1)

Related Questions