Reputation: 24665
"CATARACT; #大腿骨~2010"
I need to pick up the 大腿骨
in R using gsub
, it is actually unicode that starts with &#
followed by a five digits number and then ended with ;
.
I know how to get rid of these unicode using the following:
gsub("&#[0-9]+;","","CATARACT; #大腿骨~2010")
But how can I retain these unicode using gsub
?
My desired output is 大腿骨
.
Thanks for the answer, but what if the pattern is not always like that, I need to pick up the unicode no matter where they are:
"CATARACT; #大腿骨~2010;CATARACT; #夨膀骩~2010"
Upvotes: 0
Views: 70
Reputation: 54237
E.g. using gregexpr
and regmatches
:
ex <- "CATARACT; #大腿骨~2010;CATARACT; #夨膀骩~2010"
m <- gregexpr("&#[0-9]+;", ex)
(r <- regmatches(ex, m))
# [[1]]
# [1] "大" "腿" "骨" "夨" "膀" "骩"
paste(r[[1]], collapse="")
# [1] "大腿骨夨膀骩"
Upvotes: 1