R: Perl Regex for unicode character string

Question

I am trying to get rid of some unicode character strings spread out in my data.

Sample data <- "['oguma', 'makeup', u'\u0e27\u0e34\u0e15\u0e32\u0e21\u0e34\u0e19\u0e2b\u0e19\u0e49\u0e32\u0e40\u0e14\u0e47\u0e01', 'jeban',]"

I want to capture everything starting with a u'\ and include the comma at the end.

I was thinking of starting with:

gsub("u/\/\'....

+ everything including the next comma, but I'm not sure how to say that second part.

For a result of:

Sample data <- "['oguma', 'makeup', 'jeban',]"

suggestions?

Wiktor Stribiżew · Accepted Answer

Here is a regex solution that will remove the substrings starting with u', followed with non-ASCII characters (1 or more) and end with a comma (optional, 1 or 0) and whitespaces (also optional, 0 or more):

data <- "['oguma', 'makeup', u'\u0e27\u0e34\u0e15\u0e32\u0e21\u0e34\u0e19\u0e2b\u0e19\u0e49\u0e32\u0e40\u0e14\u0e47\u0e01', 'jeban',]"
gsub("u'[^[:ascii:]]+',?\s*", "", data, perl=T)
## => [1] "['oguma', 'makeup', 'jeban',]"

See IDEONE demo

Note that the \u0e27-like substrings in your example are just non-ASCII characters that - if you print the string - will be displayed correctly as those letters/symbols (here, u'วิตามินหน้าเด็ก', Thai for "vitamins for kids" - Google Translate).

R: Perl Regex for unicode character string

Answers (1)

Related Questions