Balamurugan Annamalai
Balamurugan Annamalai

Reputation: 43

Regex in R to extract words before a special character

I having a dataframe of part of speech tagged strings Example:

best_JJS phone_NN only_RB issue_NN camera_NN sensor_NN have_VB mind_NN own_JJ 

I want to remove the tags after/and the '_' so that I have the output

best phone only issue camera sensor have mind own

I am using R and I couldn't find an appropriate regex for the gsub function. I tried this.

sentence= c("best_JJS phone_NN only_RB issue_NN camera_NN sensor_NN have_VB mind_NN own_JJ")
o1=gsub("\\_.*","",sentence, perl = T)

But This removes entire string after the first underscore. Thanks in Advance

Upvotes: 1

Views: 1469

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626845

You may use _[A-Z]+ TRE pattern with gsub:

sentence <- c("best_JJS phone_NN only_RB issue_NN camera_NN sensor_NN have_VB mind_NN own_JJ")
gsub("_[A-Z]+","",sentence)
[1] "best phone only issue camera sensor have mind own"

See the R demo

The _[A-Z]+ pattern matches an underscore (_, note it does not have to be escaped in a regex pattern) and one or more (+) uppercase ASCII letters ([A-Z]).

You may further precise the pattern, say, to only match the _ if it is preceded with a word char and match uppercase letters only when followed with a word boundary:

"\\B_[A-Z]+\\b

In case you want to create a very specific regex for the POS values, you may use alternation:

"\\B_(JJ|NN|CC|[VR]B)\\b"

And continue adding |<code> to the regex pattern.

Upvotes: 1

Related Questions