How to use fread() as readLines() without auto column detection?

Question

I have a 5Gb .dat file (> 10million lines). The format of each line is like aaaa bb cccc0123 xxx kkkkkkkkkkkkkk or aaaaabbbcccc01234xxxkkkkkkkkkkkkkk for example. Because readLines has poor performance while reading big file, I choose fread() to read this, but error was occurred:

library("data.table")
x <- fread("test.DAT")
Error in fread("test.DAT") : 
  Expecting 5 cols, but line 5 contains text after processing all cols. It is very likely that this is due to one or more fields having embedded sep=' ' and/or (unescaped) '
' characters within unbalanced unescaped quotes. fread cannot handle such ambiguous cases and those lines may not have been read in as expected. Please read the section on quotes in ?fread.
In addition: Warning message:
In fread("test.DAT") :
  Unable to find 5 lines with expected number of columns (+ middle)

How to use fread() as readLines() without auto column detecting? Or is there any other way to solve this problem?

Rich Scriven · Accepted Answer

Here's a trick. You could use a sep value that you know is not in the file. Doing that forces fread() to read the whole line as a single column. Then we can drop that column to an atomic vector (shown as [[1L]] below). Here's an example on a csv where I use ? as the sep. This way it acts similar to readLines(), only a lot faster.

f <- fread("Batting.csv", sep= "?", header = FALSE)[[1L]]
head(f)
# [1] "playerID,yearID,stint,teamID,lgID,G,AB,R,H,2B,3B,HR,RBI,SB,CS,BB,SO,IBB,HBP,SH,SF,GIDP"
# [2] "abercda01,1871,1,TRO,NA,1,4,0,0,0,0,0,0,0,0,0,0,,,,,"       
# [3] "addybo01,1871,1,RC1,NA,25,118,30,32,6,0,0,13,8,1,4,0,,,,,"  
# [4] "allisar01,1871,1,CL1,NA,29,137,28,40,4,5,0,19,3,1,2,5,,,,," 
# [5] "allisdo01,1871,1,WS3,NA,27,133,28,44,10,2,2,27,1,1,0,2,,,,,"
# [6] "ansonca01,1871,1,RC1,NA,25,120,29,39,11,3,0,16,6,2,2,1,,,,,"

Other uncommon characters you can try in sep are \ ^ @ # = and others. We can see that this will produce the same output as readLines(). It's just a matter of finding a sep value that is not present in the file.

head(readLines("Batting.csv"))
# [1] "playerID,yearID,stint,teamID,lgID,G,AB,R,H,2B,3B,HR,RBI,SB,CS,BB,SO,IBB,HBP,SH,SF,GIDP"
# [2] "abercda01,1871,1,TRO,NA,1,4,0,0,0,0,0,0,0,0,0,0,,,,,"                                  
# [3] "addybo01,1871,1,RC1,NA,25,118,30,32,6,0,0,13,8,1,4,0,,,,,"                             
# [4] "allisar01,1871,1,CL1,NA,29,137,28,40,4,5,0,19,3,1,2,5,,,,,"                            
# [5] "allisdo01,1871,1,WS3,NA,27,133,28,44,10,2,2,27,1,1,0,2,,,,,"                           
# [6] "ansonca01,1871,1,RC1,NA,25,120,29,39,11,3,0,16,6,2,2,1,,,,,"

Note: As @Cath has mentioned in the comments, you could also simply use the line break character as the sep value.

How to use fread() as readLines() without auto column detection?

Answers (1)

Related Questions