1089
1089

Reputation: 145

Consecutive string matching in a sentence using R

I have names of some 7 countries which is stored somewhere like:

Random <- c('norway', 'india', 'china', 'korea', 'france','japan','iran')

Now, I have to find out using r if a given sentence has these words. Sometimes the name of a country is hiding in the consecutive letters within a sentence. for ex:

You all must pay it bac**k, or ea**ch of you will be in trouble.

If this sentence is passed it should return "korea"

I have tried:

grep('You|all|must|pay|it|back|or|each|of|you|will|be|in|trouble',Random, value = TRUE,ignore.case=TRUE,
 fixed = FALSE)

it should return korea

but it's not working. Perhaps I should not use Partial Matching, but i dont have much knowledge regarding it.

Any help is appreciated.

Upvotes: 2

Views: 1922

Answers (4)

Rich Scriven
Rich Scriven

Reputation: 99331

You can use the handy stringr library for this. First, remove all the punctuation and spaces from your sentence that we want to match.

> library(stringr)
> txt <- "You all must pay it back, or each of you will be in trouble."
> g <- gsub("[^a-z]", "", tolower(txt))
# [1] "Youallmustpayitbackoreachofyouwillbeintrouble"

Then we can use str_detect to find the matches.

> Random[str_detect(g, Random)]
# [1] "korea"

Basically you're just looking for a sub-string within a sentence, so collapsing the sentence first seems like a good way to go. Alternatively, you could use str_locate with str_sub to find the relevant sub-strings.

> no <- na.omit(str_locate(g, Random))
> str_sub(g, no[,1], no[,2])
# [1] "korea"

Edit Here's one more I came up with

> Random[Vectorize(grepl)(Random, g)]
# [1] "korea"

Upvotes: 2

akrun
akrun

Reputation: 887118

You could use stringi which is faster for these operations

library(stringi)
Random[stri_detect_regex(gsub("[^A-Za-z]", "", txt), Random)]
#[1] "korea"

#data
Random <- c('norway', 'india', 'china', 'korea', 'france','japan','iran')    
txt <- "You all must pay it back, or each of you will be in trouble."

Upvotes: 1

rnso
rnso

Reputation: 24545

Try:

Random <- c('norway', 'india', 'china', 'korea', 'france','japan','iran')    
txt <- "You all must pay it back, or each of you will be in trouble."

tt  <- gsub("[[:punct:]]|\\s+", "", txt)

unlist(sapply(Random, function(r) grep(r, tt)))
korea 
    1 

Upvotes: 0

sidpat
sidpat

Reputation: 745

Using base functions only:

Random <- c('norway', 'india', 'china', 'korea', 'france','japan','iran')
Random2=paste(Random,collapse="|")     #creating pattern for match

text="bac**k, or ea**ch of you will be in trouble."
text2=gsub("[[:punct:][:space:]]","",text,perl=T)  #removing punctuations and space characters

regmatches(text2,gregexpr(Random2,text2))
[[1]]
[1] "korea"

Upvotes: 1

Related Questions