Reputation: 67
I'm trying to read.csv thousands of csv files into R but am into a lot of trouble when a my text has commas.
My csv file has 16 columns with headers. Some of the text in column 1 has commas. Column 2 is a string, and Column 3 is always a number.
For instance an entry in column 1 is: "I do not know Robert, Kim, or Douglas"- Marcus. A. Ten, Inc President
When I try to
df <- do.call("rbind", lapply(paste(CSVpath, fileNames, sep=""), read.csv, header=TRUE, stringsAsFactors=TRUE, row.names=NULL))
I get a df with more than 16 columns and the above text is split into 4 columns:
V1 V2 V3 V4
"I do not know Robert Kim or Douglas" - Marcus. A. Ten Inc President
when I need it all in one column as:
V1
"I do not know Robert, Kim, or Douglas"- Marcus. A. Ten, Inc President
Upvotes: 0
Views: 1563
Reputation: 161110
First, if you have control over the data output format, I strongly urge you to either (a) correctly quote the fields, or (b) use another character as a delimiter (e.g., tab, pipe "|"). This is the ideal solution, as it will certainly speed up future processing and "fix the glitch", so to speak.
Lacking that, you can try to programmatically fix all rows. Assuming that only the first column is problematic (i.e., all of the other columns are perfectly defined), then on a line-by-line basis, change the true-separators to a different delimiter (e.g., pipe or tab).
For this example, I have 4 columns delimited with a comma, and I'm going to change the legitimate separators to a pipe.
Some data and magic constants:
txt <- '"I do not know Robert, Kim, or Douglas" - Marcus. A. Ten, Inc President,TRUE,0,14
"Something, else",FALSE,1,15
"Something correct",TRUE,2,22
Something else,FALSE,3,33'
nColumns <- 4 # known a priori
oldsep <- ","
newsep <- "|"
In your case, you'll read in the data:
txt <- readLines("path/to/malformed.csv")
nColumns <- 16
Do a manual (text-based, not parsing for data types) separation:
splits <- strsplit(readLines(textConnection(txt)), oldsep)
Realize that this reads, for example, the false fields as the verbatim characters "FALSE"
, not as a boolean data type. This might be avoided if we take on the magic type-detection done by read.csv
and cousins, but why?
Per line: first ignore the last nColumns-1
fields, take the first fields and recombine them with the old separator, resulting in a single field (with commas); now combine this with the remaining nColumns-1
fields and combine these with the new separator. (BTW: making sure we deal with quoting double-quotes correctly, too.)
txt2 <- sapply(splits, function(vec) {
n <- length(vec)
if (n < nColumns) return(paste(vec, collapse = newsep))
vec1 <- paste(vec[1:(n - nColumns + 1)], collapse = oldsep)
vec1 <- sprintf('"%s"', gsub('"', '""', vec1))
paste(c(vec1,
vec[(n - nColumns + 2):n]), collapse = newsep)
})
txt2[1]
# [1] "\"\"\"I do not know Robert, Kim, or Douglas\"\" - Marcus. A. Ten, Inc President\"|TRUE|0|14"
(The sprintf
line may not be necessary if the original file has correct quoting of double-quotes ... but then again, if it had correct quoting, we wouldn't be having this problem in the first place.)
Now, either absorb the data directly into a data.frame:
read.csv(textConnection(txt2), header = FALSE, sep = newsep)
# V1 V2 V3 V4
# 1 "I do not know Robert, Kim, or Douglas" - Marcus. A. Ten, Inc President TRUE 0 14
# 2 "Something, else" FALSE 1 15
# 3 "Something correct" TRUE 2 22
# 4 Something else FALSE 3 33
or write these back to a file (good if you want to deal with these files elsewhere), adding con = "path/to/filename
as appropriate:
writeLines(txt2)
# """I do not know Robert, Kim, or Douglas"" - Marcus. A. Ten, Inc President"|TRUE|0|14
# """Something, else"""|FALSE|1|15
# """Something correct"""|TRUE|2|22
# "Something else"|FALSE|3|33
(Two notable changes: the correct comma-delimiters are now pipes, and all other commas are still commas; and there is correct quoting around the double-quotes. Yes, an escaped double-quote is just two double-quotes. That's what R expects if there are quotes within a field.)
NB: though this seems to work with my fabricated data (and I hope it works with yours), you do not hear of people touting R's speed and efficiency in doing text mangling in this fashion. There are certainly better ways to do this, perhaps using python, awk, or sed. There are possibly faster ways to do this in R.
Upvotes: 1