Macosso
Macosso

Reputation: 1439

Reading a file that has non fixed number of columns fread() in R

I am trying to read a file which in default is supposed to have 7 columns but probably there might be some commas within some strings which is causing other rows to have more than 7 columns. Regardless of which info that is in other columns my only goal is to read the first 7 columns. However, fread is not reading the whole file even after adding the argument select = 1:7

> data <- fread("dpp.DAT",header=FALSE, fill=T, select = 1:7, sep=", ",stringsAsFactors = F)
Warning message:
In fread("dpp.DAT", header = FALSE, fill=T, select = 1:7,sep = ",", stringsAsFactors = F) :
  Stopped early on line 45922. Expected 7 fields but found 8. Consider fill=TRUE and comment.char=. First discarded non-empty line: <<84172666,DS,BRAND 4 - DERIVATIVE,#PL LOC BDD  :  BDD - BRAND 3 - DERIVATIVE,37324,BLEND-A-MD-INSPRD-BY-NTR-SGHH,BLEND B MAR INSPIRED BY OTHER CHAMOMILE, VAG + HHHH>>

Is there trick you can suggest to read all the rows of the file?

Sample dataset

Upvotes: 2

Views: 708

Answers (3)

rferrisx
rferrisx

Reputation: 1728

Dean's answer provides more automation than mine. Whenever I hit this problem (which is actually probably poorly formatted data), I resort to manually finding and then rebuilding the extract with rbind:

s1 <- fread("Extract.txt",
    nrows=674170,
    strip.white = TRUE,
    fill = TRUE,
    blank.lines.skip = TRUE,
    encoding="UTF-8")

s2 <- fread("Extract.txt",
    strip.white = TRUE,
    fill = TRUE,
    blank.lines.skip = TRUE,
    skip=674170,
    encoding="UTF-8")
# ad.infinitum until you complete "Extract.txt"
s3 <- rbind(s1,s2)
rm(s1)
rm(s2)

Upvotes: 1

Dean MacGregor
Dean MacGregor

Reputation: 18396

data.table gets finicky about the extra columns showing up in the middle as opposed to the beginning so that's why using select and fill don't work here. What you can do is take all the rows it gives you up front and then try again with skip on the rows you've already loaded. On that second (or more) attempt the extra columns will now be in the beginning so fill and select work as expected. There are probably more elegant ways to do the following but this works

library(data.table)

#capture warnings so we can evaluate what happened last in code
tempfile='tmp321364.txt' 
conn<-file(tempfile, open="r+")
sink(file=conn, type='message')

DT<-list()
while(TRUE) {
  DT[[length(DT)+1]] <- fread(filename, header=FALSE,stringsAsFactors = F, fill=T, select=1:7, skip=ifelse(length(DT)>0,sum(sapply(DT, nrow)),0))
  if(nrow(DT[[length(DT)]])==0) break
  warns<-readLines(conn)
  if(length(warns)==3) { #The warning about extra columns is 3 lines long
    DT[[length(DT)+1]]<-  fread(filename, header=FALSE,stringsAsFactors = F, fill=T, select=1:7, skip=sum(sapply(DT, nrow)))
    if(nrow(DT[[length(DT)]])==0) break
  } else { #an error about skipping too many rows is not 3 lines, assuming away other issues
    break
  }
}
DT<-rbindlist(DT)
sink(NULL, type='message')
close(conn)
rm(tempfile)

With your exact data, you don't need the while(TRUE) loop but if, for example, there were a 10th column that shows up even further down then this will work for those cases.

Upvotes: 3

user438383
user438383

Reputation: 6206

Say we have a text file "test.txt" like this:

a,b,c
d,e,f
g,h,i,j
k,l,m

We can read it in and set FILL=T and then subset the final column out:

> fread("test.txt", fill=T)[,-4]
   V1 V2 V3
1:  a  b  c
2:  d  e  f
3:  g  h  i
4:  k  l  m

Or, set select=1:3:

> fread("test.txt", fill=T, select = 1:3)
   V1 V2 V3
1:  a  b  c
2:  d  e  f
3:  g  h  i
4:  k  l  m

EDIT

The solution was to use the cut unix command as such:

terminal$ cut Test_Fread_column.DAT -d',' -f1-7 > tmp
R> fread("tmp")

Upvotes: 4

Related Questions