lmcshane
lmcshane

Reputation: 1114

R: Perl Regex for unicode character string

I am trying to get rid of some unicode character strings spread out in my data.

Sample data <- "['oguma', 'makeup', u'\u0e27\u0e34\u0e15\u0e32\u0e21\u0e34\u0e19\u0e2b\u0e19\u0e49\u0e32\u0e40\u0e14\u0e47\u0e01', 'jeban',]"

I want to capture everything starting with a u'\ and include the comma at the end.

I was thinking of starting with:

gsub("u/\\/\'....

+ everything including the next comma, but I'm not sure how to say that second part.

For a result of:

Sample data <- "['oguma', 'makeup', 'jeban',]"

suggestions?

Upvotes: 1

Views: 230

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627083

Here is a regex solution that will remove the substrings starting with u', followed with non-ASCII characters (1 or more) and end with a comma (optional, 1 or 0) and whitespaces (also optional, 0 or more):

data <- "['oguma', 'makeup', u'\u0e27\u0e34\u0e15\u0e32\u0e21\u0e34\u0e19\u0e2b\u0e19\u0e49\u0e32\u0e40\u0e14\u0e47\u0e01', 'jeban',]"
gsub("u'[^[:ascii:]]+',?\\s*", "", data, perl=T)
## => [1] "['oguma', 'makeup', 'jeban',]"

See IDEONE demo

Note that the \u0e27-like substrings in your example are just non-ASCII characters that - if you print the string - will be displayed correctly as those letters/symbols (here, u'วิตามินหน้าเด็ก', Thai for "vitamins for kids" - Google Translate).

Upvotes: 1

Related Questions