Reputation: 1114
I am trying to get rid of some unicode character strings spread out in my data.
Sample data <- "['oguma', 'makeup', u'\u0e27\u0e34\u0e15\u0e32\u0e21\u0e34\u0e19\u0e2b\u0e19\u0e49\u0e32\u0e40\u0e14\u0e47\u0e01', 'jeban',]"
I want to capture everything starting with a u'\ and include the comma at the end.
I was thinking of starting with:
gsub("u/\\/\'....
+ everything including the next comma, but I'm not sure how to say that second part.
For a result of:
Sample data <- "['oguma', 'makeup', 'jeban',]"
suggestions?
Upvotes: 1
Views: 230
Reputation: 627083
Here is a regex solution that will remove the substrings starting with u'
, followed with non-ASCII characters (1 or more) and end with a comma (optional, 1 or 0) and whitespaces (also optional, 0 or more):
data <- "['oguma', 'makeup', u'\u0e27\u0e34\u0e15\u0e32\u0e21\u0e34\u0e19\u0e2b\u0e19\u0e49\u0e32\u0e40\u0e14\u0e47\u0e01', 'jeban',]"
gsub("u'[^[:ascii:]]+',?\\s*", "", data, perl=T)
## => [1] "['oguma', 'makeup', 'jeban',]"
See IDEONE demo
Note that the \u0e27
-like substrings in your example are just non-ASCII characters that - if you print the string - will be displayed correctly as those letters/symbols (here, u'วิตามินหน้าเด็ก'
, Thai for "vitamins for kids" - Google Translate).
Upvotes: 1