Reputation: 1493
I have a tool (exe provided to me), which outputs poorly formatted csv's. They are bad in that the last value can have commas, with no quotes, e.g.:
184500,OBJECT_CALENDAR,,,UNITS_NO_UNITS,NULL,,,,Sched N&S B1,1st,3rd,4S,5th&6th
Where the last string actually begins at 'Sched', so I would expect to see something like this:
184500,OBJECT_CALENDAR,,,UNITS_NO_UNITS,NULL,,,,"Sched N&S B1,1st,3rd,4S,5th&6th"
This is screwing up everything I am trying to do, and I am curious how to address it. Is there a way to define the number of columns in read.csv?
I have tried to read it line by line, but it is slow, and less than elegant:
processFile = function(filepath) {
i = 1
vector = character(0)
theFile = file(filepath, "r")
while ( TRUE ) {
line = readLines(theFile, n = 1)
if ( length(line) == 0 ) {
break
} else {
vector[i] <- line
i = i+1
}
}
close(theFile)
formatted <- lapply(strsplit(vector[-1],','), function(x) {c(x[1:9], paste(x[10:length(x)], collapse = ','))})
finalFrame <- as.data.frame(matrix(unlist(formatted),ncol = 10, byrow = TRUE))
return(finalFrame)
}
Any better ways to do this? Any base functions that can do this, and if not, any libraries?
Upvotes: 4
Views: 10084
Reputation: 9666
Specifying the classes for each column seems to work in my case. So if you have 4 columns and the 4th one might have varying number of commas, try this:
theData <- read.table(filepath, colClasses=rep("character" ,4))
Of course adjust the number of columns and their classes to your situation. Here is what I get on toy csv file:
> read.table("tmp.csv", colClasses=rep("character" ,4), header=FALSE)
V1 V2 V3 V4
1 A, B, C, 1&2
2 A, C, C, 1,2,3
3 A, V, X, 12
4 A, V, D, 1,0
Another option would be to use read.csv
with fill=TRUE
argument
theData <- read.csv(filepath, fill=TRUE)
This will produce a data.frame
with number of columns equal to the line with the maximum number of commas. Then you would have to manually combine those split commas into one.
NOTE: this will work in the case when only the last column can have loose commas.
Upvotes: 3
Reputation: 12165
This isn't ideal since you still have to read the file in line by line, but
stringr::str_split
has a parameter n
that specifies the maximum number of splits. If you set pattern = ","
and n=10
, then it will split your string into only 10 pieces, leaving the last chunk as a single string.
Upvotes: 3