"CATARACT; #大腿骨~2010" I need to pick up the 大腿骨 in R using gsub , it is actually unicode that starts with &# followed by a five digits number and then ended with ; . I know how to get rid of these unicode using the following: gsub("&#[0-9]+;","","CATARACT; #大腿骨~2010") But how can I retain these unicode using gsub ? Edit 01 My desired output is 大腿骨 . Edit 02 Thanks for the answer, but what if the pattern is not always like that, I need to pick up the unicode no matter where they are: "CATARACT; #大腿骨~2010;CATARACT; #夨膀骩~2010"

Reputation: 24665

pick up string with specific pattern in R using gsub

"CATARACT; #大腿骨~2010"

I need to pick up the 大腿骨 in R using gsub, it is actually unicode that starts with &# followed by a five digits number and then ended with ;.

I know how to get rid of these unicode using the following:

gsub("&#[0-9]+;","","CATARACT; #大腿骨~2010")

But how can I retain these unicode using gsub?

Edit 01

My desired output is 大腿骨.

Edit 02

Thanks for the answer, but what if the pattern is not always like that, I need to pick up the unicode no matter where they are:

"CATARACT; #大腿骨~2010;CATARACT; #夨膀骩~2010"

Upvotes: 0

Answers (2)

lukeA

Reputation: 54237

E.g. using gregexpr and regmatches:

ex <- "CATARACT; #&#22823;&#33151;&#39592;~2010;CATARACT; #&#22824;&#33152;&#39593;~2010"
m <- gregexpr("&#[0-9]+;", ex)
(r <- regmatches(ex, m))
# [[1]]
# [1] "&#22823;" "&#33151;" "&#39592;" "&#22824;" "&#33152;" "&#39593;"

paste(r[[1]], collapse="")
# [1] "&#22823;&#33151;&#39592;&#22824;&#33152;&#39593;"

Upvotes: 1

droopy

Reputation: 2818

you can try :

 gsub("(^\\D*)((&#[0-9]+;)+)(.*$)", "\\2", x)

Upvotes: 0

pick up string with specific pattern in R using gsub

Edit 01

Edit 02

Answers (2)

Related Questions