Reputation: 23
I have a data file in weird format that I'm trying to import and wrangle into R. The file is like a CSV in that values are comma delimited. However, there are also text strings containing commas that shouldn't be interpreted as delimiters.
The problem is, there is no use of quotes for individual elements. In the proprietary software, text strings containing non-delimiting commas are bounded by special, non-ASCII characters, which I'm guessing is how that software determines which commas not to use as delimiters.
I've figured out how to replace the non-ASCII characters with ASCII characters, to make it easier to work with in R. But, I still haven't figured out how to correctly split the text.
My current approach is to replace all the non-delimiting commas with another symbol (say, a semicolon) and then use strsplit
for the commas. But, I haven't figured out how to replace the non-delimiting commas with semicolons.
For data privacy reasons, I'll create a dummy text string to work with here:
my_string <- "1,2,3,4,NONASCIIsome text, hereNONASCII,7,8,9,NONASCII,more, pesky, commas,NONASCII,10,11"
My desired output:
[1] "1" "2" "3" "4"
[5] "some text; here" "7" "8" "9"
[9] ";more; pesky; commas;" "10" "11"
If I run strsplit
as-is (after removing the "NONASCII"), I'd get something like this, which is not what I want:
[1] "1" "2" "3" "4" "some text" " here" "7" "8" "9"
[10] "" "more" " pesky" " commas" "" "10" "11"
I'm not sure if gsub
is the best function to use for this, but I've figured out how to at least match the problematic strings with regex:
my_string2 <- gsub("NONASCII(.*?)NONASCII", "\\1", my_string)
That's about as far as I've gotten. The code above only gets rid of the "NONASCII", but it doesn't replace the non-delimiting commas with semicolons
If there's another approach that would work too, I'm all ears!
Upvotes: 2
Views: 103
Reputation: 269566
There is already a good answer -- in fact I was going to post a nearly identical one but @lotus answered first; however, let me point out one other approach.
gsubfn
is like gsub
except it inputs the captured portion of the regular expression (the portion within parentheses) into the function specified in the second argument and replaces the entire match with that function's output. The function can be specified by giving the body as the right hand side of a formula and any free variables are assumed to be input arguments -- here x
. Thus gsub
is only applied between the markers after which we can use strsplit
.
If my_string
is a character vector rather than just a single string then omit the unlist()
at the end so that we get a list of character vectors of parsed values.
library(gsubfn)
my_string |>
gsubfn("NONASCII(.*?)NONASCII", ~ gsub(",", ";", x), x = _) |>
strsplit(",") |>
unlist()
giving
[1] "1" "2" "3"
[4] "4" "some text; here" "7"
[7] "8" "9" ";more; pesky; commas;"
[10] "10"
Here is a base solution; however, it does involve 3 gsub
calls and a zero width perl regexp. It gives the same result as shown above. Use characters other than { and } if those can appear in your data.
The key here is that the start and end NONASCII be different so first replace the start with { and the end with }. Then we can use the indicated regex to replace , within {...} only. Finally remove the { and } we added. Now we can split on , as before.
my_string |>
gsub("NONASCII(.*?)NONASCII", "{\\1}", x = _) |>
gsub(",(?=[^{}]*})", ";", x = _, perl = TRUE) |>
gsub("[{}]", "", x = _) |>
strsplit(",") |>
unlist()
Input as given in question:
my_string <- "1,2,3,4,NONASCIIsome text, hereNONASCII,7,8,9,NONASCII,more, pesky, commas,NONASCII,10,11"
Upvotes: 2
Reputation: 34441
You can replace 'NONASCII' with quotes, scan the data in, and replace the commas with semicolons:
my_string |>
gsub("NONASCII", "'", x = _) |>
scan(text = _, sep = ",", what = character(), quiet = TRUE) |>
gsub(",", ";", x = _)
[1] "1" "2" "3"
[4] "4" "some text; here" "7"
[7] "8" "9" ";more; pesky; commas;"
[10] "10" "11"
Upvotes: 4