Replace all instances of a character, but only within matching strings

Question

I have a data file in weird format that I'm trying to import and wrangle into R. The file is like a CSV in that values are comma delimited. However, there are also text strings containing commas that shouldn't be interpreted as delimiters.

The problem is, there is no use of quotes for individual elements. In the proprietary software, text strings containing non-delimiting commas are bounded by special, non-ASCII characters, which I'm guessing is how that software determines which commas not to use as delimiters.

I've figured out how to replace the non-ASCII characters with ASCII characters, to make it easier to work with in R. But, I still haven't figured out how to correctly split the text.

My current approach is to replace all the non-delimiting commas with another symbol (say, a semicolon) and then use strsplit for the commas. But, I haven't figured out how to replace the non-delimiting commas with semicolons.

For data privacy reasons, I'll create a dummy text string to work with here:

my_string <- "1,2,3,4,NONASCIIsome text, hereNONASCII,7,8,9,NONASCII,more, pesky, commas,NONASCII,10,11"

My desired output:

 [1] "1"                    "2"                    "3"                    "4"                   
 [5] "some text; here"      "7"                    "8"                    "9"                   
 [9] ";more; pesky; commas;" "10"                   "11"

If I run strsplit as-is (after removing the "NONASCII"), I'd get something like this, which is not what I want:

[1] "1"         "2"         "3"         "4"         "some text" " here"     "7"         "8"         "9"        
[10] ""          "more"      " pesky"    " commas"   ""          "10"        "11"

I'm not sure if gsub is the best function to use for this, but I've figured out how to at least match the problematic strings with regex:

my_string2 <- gsub("NONASCII(.*?)NONASCII", "\1", my_string)

That's about as far as I've gotten. The code above only gets rid of the "NONASCII", but it doesn't replace the non-delimiting commas with semicolons

If there's another approach that would work too, I'm all ears!

lroha · Accepted Answer

You can replace 'NONASCII' with quotes, scan the data in, and replace the commas with semicolons:

my_string |>
  gsub("NONASCII", "'", x = _) |>
  scan(text = _, sep = ",", what = character(), quiet = TRUE) |>
  gsub(",", ";", x = _)

 [1] "1"                     "2"                     "3"                    
 [4] "4"                     "some text; here"       "7"                    
 [7] "8"                     "9"                     ";more; pesky; commas;"
[10] "10"                    "11"

Replace all instances of a character, but only within matching strings

Answers (2)

Note

Related Questions