Reputation: 1439
I am trying to read a file which in default is supposed to have 7 columns but probably there might be some commas within some strings which is causing other rows to have more than 7 columns.
Regardless of which info that is in other columns my only goal is to read the first 7 columns. However, fread is not reading the whole file even after adding the argument select = 1:7
> data <- fread("dpp.DAT",header=FALSE, fill=T, select = 1:7, sep=", ",stringsAsFactors = F)
Warning message:
In fread("dpp.DAT", header = FALSE, fill=T, select = 1:7,sep = ",", stringsAsFactors = F) :
Stopped early on line 45922. Expected 7 fields but found 8. Consider fill=TRUE and comment.char=. First discarded non-empty line: <<84172666,DS,BRAND 4 - DERIVATIVE,#PL LOC BDD : BDD - BRAND 3 - DERIVATIVE,37324,BLEND-A-MD-INSPRD-BY-NTR-SGHH,BLEND B MAR INSPIRED BY OTHER CHAMOMILE, VAG + HHHH>>
Is there trick you can suggest to read all the rows of the file?
Upvotes: 2
Views: 708
Reputation: 1728
Dean's answer provides more automation than mine. Whenever I hit this problem (which is actually probably poorly formatted data), I resort to manually finding and then rebuilding the extract with rbind:
s1 <- fread("Extract.txt",
nrows=674170,
strip.white = TRUE,
fill = TRUE,
blank.lines.skip = TRUE,
encoding="UTF-8")
s2 <- fread("Extract.txt",
strip.white = TRUE,
fill = TRUE,
blank.lines.skip = TRUE,
skip=674170,
encoding="UTF-8")
# ad.infinitum until you complete "Extract.txt"
s3 <- rbind(s1,s2)
rm(s1)
rm(s2)
Upvotes: 1
Reputation: 18396
data.table
gets finicky about the extra columns showing up in the middle as opposed to the beginning so that's why using select
and fill
don't work here. What you can do is take all the rows it gives you up front and then try again with skip
on the rows you've already loaded. On that second (or more) attempt the extra columns will now be in the beginning so fill
and select
work as expected. There are probably more elegant ways to do the following but this works
library(data.table)
#capture warnings so we can evaluate what happened last in code
tempfile='tmp321364.txt'
conn<-file(tempfile, open="r+")
sink(file=conn, type='message')
DT<-list()
while(TRUE) {
DT[[length(DT)+1]] <- fread(filename, header=FALSE,stringsAsFactors = F, fill=T, select=1:7, skip=ifelse(length(DT)>0,sum(sapply(DT, nrow)),0))
if(nrow(DT[[length(DT)]])==0) break
warns<-readLines(conn)
if(length(warns)==3) { #The warning about extra columns is 3 lines long
DT[[length(DT)+1]]<- fread(filename, header=FALSE,stringsAsFactors = F, fill=T, select=1:7, skip=sum(sapply(DT, nrow)))
if(nrow(DT[[length(DT)]])==0) break
} else { #an error about skipping too many rows is not 3 lines, assuming away other issues
break
}
}
DT<-rbindlist(DT)
sink(NULL, type='message')
close(conn)
rm(tempfile)
With your exact data, you don't need the while(TRUE)
loop but if, for example, there were a 10th column that shows up even further down then this will work for those cases.
Upvotes: 3
Reputation: 6206
Say we have a text file "test.txt"
like this:
a,b,c
d,e,f
g,h,i,j
k,l,m
We can read it in and set FILL=T
and then subset the final column out:
> fread("test.txt", fill=T)[,-4]
V1 V2 V3
1: a b c
2: d e f
3: g h i
4: k l m
Or, set select=1:3
:
> fread("test.txt", fill=T, select = 1:3)
V1 V2 V3
1: a b c
2: d e f
3: g h i
4: k l m
EDIT
The solution was to use the cut
unix command as such:
terminal$ cut Test_Fread_column.DAT -d',' -f1-7 > tmp
R> fread("tmp")
Upvotes: 4