JoeF
JoeF

Reputation: 853

Replace improper commas in CSV file

This question may have been asked before, but I couldn't find it. I have a list of CSV files (439 or so) where, in a few of the files, someone also used commas in editorial comments. The result is that I can't put the files into a data frame, since the files now do not have the same number of elements after splitting them. Anyways, the problem I'm facing looks like this:

vec1 <- paste("484,1213,0,62.0006,1,go -- late F1 max, but glide?")
vec2 <- paste("467,1387,0,62.0026,1,goes2")

ls <- list(vec1, vec2)

What I want to do is to have a data frame with six columns. If there wasn't a comma in the editorial comments for vec1, I could use (and have been using, until I found this problematic example) the following:

df <- ldply(ls, function(x)unlist(strsplit(x[1], split = ",")))

However, I'm getting the obvious error message that the results do not have the same number of lengths. Is there any way of getting rid of that comma, or turning it into a semi-colon, or ensuring that, if there are 7 elements in a vector, that 6 and 7 are combined?

If it helps, this is how I'm reading the files in R (I'm using scan because there is other information in the files that I want. There's some odd encoding issues going on here as well, but this seems to work).

data <- scan(file, fileEncoding="latin1", blank.lines.skip = FALSE, what = "list", sep = "\n", quiet = TRUE)   

Upvotes: 1

Views: 93

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626728

If you need the comments, you still can replace the 6th comma with a semicolon and use your previous solution:

gsub("((?:[^,]*,){5}[^,]*),", "\\1;", vec1, perl=TRUE)

Regex explanation:

  • ((?:[^,]*,){5}[^,]*) - a capturing group that we will reference to as Group 1 with \\1 in the replacement pattern, matching
    • (?:[^,]*,){5} - 5 sequences of non-comma characters followed by a comma
    • [^,]* - 0 or more non-commas
  • , - the comma we'll turn into a ; in the replacement

Or (as @CathG pointed out, a \\K operator can also be used with Perl-like expressions)

sub("^([^,]+,){5}[^,]+\\K,", ";", vec1, perl=T)

From PCRE documentation:

The escape sequence \K causes any previously matched characters not to be included in the final matched sequence.

However, it will not "normalize" any other commas that might follow.

Upvotes: 2

Related Questions