Reputation: 853
This question may have been asked before, but I couldn't find it. I have a list of CSV files (439 or so) where, in a few of the files, someone also used commas in editorial comments. The result is that I can't put the files into a data frame, since the files now do not have the same number of elements after splitting them. Anyways, the problem I'm facing looks like this:
vec1 <- paste("484,1213,0,62.0006,1,go -- late F1 max, but glide?")
vec2 <- paste("467,1387,0,62.0026,1,goes2")
ls <- list(vec1, vec2)
What I want to do is to have a data frame with six columns. If there wasn't a comma in the editorial comments for vec1
, I could use (and have been using, until I found this problematic example) the following:
df <- ldply(ls, function(x)unlist(strsplit(x[1], split = ",")))
However, I'm getting the obvious error message that the results do not have the same number of lengths. Is there any way of getting rid of that comma, or turning it into a semi-colon, or ensuring that, if there are 7 elements in a vector, that 6 and 7 are combined?
If it helps, this is how I'm reading the files in R (I'm using scan
because there is other information in the files that I want. There's some odd encoding issues going on here as well, but this seems to work).
data <- scan(file, fileEncoding="latin1", blank.lines.skip = FALSE, what = "list", sep = "\n", quiet = TRUE)
Upvotes: 1
Views: 93
Reputation: 626728
If you need the comments, you still can replace the 6th comma with a semicolon and use your previous solution:
gsub("((?:[^,]*,){5}[^,]*),", "\\1;", vec1, perl=TRUE)
Regex explanation:
((?:[^,]*,){5}[^,]*)
- a capturing group that we will reference to as Group 1 with \\1
in the replacement pattern, matching
(?:[^,]*,){5}
- 5 sequences of non-comma characters followed by a comma[^,]*
- 0 or more non-commas,
- the comma we'll turn into a ;
in the replacementOr (as @CathG pointed out, a \\K
operator can also be used with Perl-like expressions)
sub("^([^,]+,){5}[^,]+\\K,", ";", vec1, perl=T)
From PCRE documentation:
The escape sequence
\K
causes any previously matched characters not to be included in the final matched sequence.
However, it will not "normalize" any other commas that might follow.
Upvotes: 2