Reputation: 145
I have names of some 7 countries which is stored somewhere like:
Random <- c('norway', 'india', 'china', 'korea', 'france','japan','iran')
Now, I have to find out using r if a given sentence has these words. Sometimes the name of a country is hiding in the consecutive letters within a sentence. for ex:
You all must pay it bac**k, or ea**ch of you will be in trouble.
If this sentence is passed it should return "korea"
I have tried:
grep('You|all|must|pay|it|back|or|each|of|you|will|be|in|trouble',Random, value = TRUE,ignore.case=TRUE,
fixed = FALSE)
it should return korea
but it's not working. Perhaps I should not use Partial Matching, but i dont have much knowledge regarding it.
Any help is appreciated.
Upvotes: 2
Views: 1922
Reputation: 99331
You can use the handy stringr
library for this. First, remove all the punctuation and spaces from your sentence that we want to match.
> library(stringr)
> txt <- "You all must pay it back, or each of you will be in trouble."
> g <- gsub("[^a-z]", "", tolower(txt))
# [1] "Youallmustpayitbackoreachofyouwillbeintrouble"
Then we can use str_detect
to find the matches.
> Random[str_detect(g, Random)]
# [1] "korea"
Basically you're just looking for a sub-string within a sentence, so collapsing the sentence first seems like a good way to go. Alternatively, you could use str_locate
with str_sub
to find the relevant sub-strings.
> no <- na.omit(str_locate(g, Random))
> str_sub(g, no[,1], no[,2])
# [1] "korea"
Edit Here's one more I came up with
> Random[Vectorize(grepl)(Random, g)]
# [1] "korea"
Upvotes: 2
Reputation: 887118
You could use stringi
which is faster for these operations
library(stringi)
Random[stri_detect_regex(gsub("[^A-Za-z]", "", txt), Random)]
#[1] "korea"
#data
Random <- c('norway', 'india', 'china', 'korea', 'france','japan','iran')
txt <- "You all must pay it back, or each of you will be in trouble."
Upvotes: 1
Reputation: 24545
Try:
Random <- c('norway', 'india', 'china', 'korea', 'france','japan','iran')
txt <- "You all must pay it back, or each of you will be in trouble."
tt <- gsub("[[:punct:]]|\\s+", "", txt)
unlist(sapply(Random, function(r) grep(r, tt)))
korea
1
Upvotes: 0
Reputation: 745
Using base functions only:
Random <- c('norway', 'india', 'china', 'korea', 'france','japan','iran')
Random2=paste(Random,collapse="|") #creating pattern for match
text="bac**k, or ea**ch of you will be in trouble."
text2=gsub("[[:punct:][:space:]]","",text,perl=T) #removing punctuations and space characters
regmatches(text2,gregexpr(Random2,text2))
[[1]]
[1] "korea"
Upvotes: 1