Reputation: 7725
I have strings like:
string <- "1, 2, \"something, else\""
I want to use tidyr::separate_rows()
with sep==","
, but the comma inside the quoted portion of the string is tripping me up. I'd like to remove the comma between something and else (but only this comma).
Here's a more complex toy example:
string <- c("1, 2, \"something, else\"", "3, 5, \"more, more, more\"", "6, \"commas, are fun\", \"no, they are not\"")
string
#[1] "1, 2, \"something, else\""
#[2] "3, 5, \"more, more, more\""
#[3] "6, \"commas, are fun\", \"no, they are not\""
I want to get rid of all commas inside the embedded quotations. Desired output:
[1] "1, 2, \"something else\""
[2] "3, 5, \"more more more\""
[3] "6, \"commas are fun\", \"no they are not\""
Upvotes: 7
Views: 641
Reputation: 19088
You can define a small function to do the replacement.
library(stringr)
rmcom <- function(x) gsub(",", "", x)
str_replace_all(string, "(\"[[:alnum:]]+,[ [:alnum:],]*\")", rmcom)
[1] "1, 2, \"something else\""
[2] "3, 5, \"more more more\""
[3] "6, \"commas are fun\", \"no they are not\""
Upvotes: 8
Reputation: 3081
Best I can do:
stringr::str_replace_all(string,"(?<=\\\".{1,15})(,)(?=.+?\\\")","")
it's:
(?<= )
= look behind
\\\"
= a \
and a "
.{1,15}
= between 1 and 15 characters (see note)
(,)
= the comma is what we want to target
(?= )
look ahead
.+?
= one or more characters but as few as possible
\\\"
= a \
and a "
note: look behind cannot be unbounded, so we can't use .+?
here. Adjust the max of 15 for your dataset.
edit: Andre Wildberg's solution is better - I stupidly forgot that the "" defining the string are not part of the string, so made it much more complex than it needed to be.
Upvotes: 3
Reputation: 7287
Altenatively, we could invert the problem (and keep the comma, which might be useful) and use a regex directly with separate_rows
to split only at the comma NOT inside quotes:
library(tidyr)
df |>
separate_rows(stringcol, sep = '(?!\\B"[^\"]*), (?![^"]*\"\\B)')
Regex expression from: Regex find comma not inside quotes
Alternatively: Regex to pick characters outside of pair of quotes
Output:
# A tibble: 9 × 1
stringcol
<chr>
1 "1"
2 "2"
3 "\"something, else\""
4 "3"
5 "5"
6 "\"more, more, more\""
7 "6"
8 "\"commas, are fun\""
9 "\"no, they are not\""
Data:
library(tibble)
df <- tibble(stringcol = string)
Upvotes: 3