Rohit Haritash
Rohit Haritash

Reputation: 404

find a string in sentence in R

Hi i am trying to find a short text in a sentence and then do some manipulation.It easy in java but in R i am having some issue.I am not reaching if condition. Here is my code

rm(list=ls())
library(tidytext)
library(dplyr)

shortText= c('grt','gr8','bcz','ur')


tweet=c('stats is gr8','this car is good','your movie is grt','i hate your book of hatred','food is awsome'
        )
tweet=data.frame(tweet, stringsAsFactors = FALSE)

for(row in 1:nrow(tweet)) {

tweetWords=strsplit(tweet[row,]," ")
print(tweetWords)
  for (word in 1:length(tweetWords)) {
    if(tweetWords[word] %in% shortText){
      print('we have a match')
    }

  }

Upvotes: 1

Views: 1085

Answers (3)

utubun
utubun

Reputation: 4505

Could it be something like that:

cbind(tweet, ifelse(sapply(shortText, grepl, x = tweet), "Match is found", "No match"))

             tweet                        grt              gr8              bcz       
    [1,] "stats is gr8"               "No match"       "Match is found" "No match"
    [2,] "this car is good"           "No match"       "No match"       "No match"
    [3,] "your movie is grt"          "Match is found" "No match"       "No match"
    [4,] "i hate your book of hatred" "No match"       "No match"       "No match"
    [5,] "food is awsome"             "No match"       "No match"       "No match"
     ur              
    [1,] "No match"      
    [2,] "No match"      
    [3,] "Match is found"
    [4,] "Match is found"
    [5,] "No match"

Upvotes: 0

Tim Biegeleisen
Tim Biegeleisen

Reputation: 522712

Here is a straightforward base R option using grepl:

shortText <- c('grt','gr8','bcz','ur')
tweet <- c('stats is gr8','this car is good','your movie is grt','i hate your book of hatred','food is awsome')

res <- sapply(shortText, function(x) grepl(paste0("\\b", x, "\\b"), tweet))
tweet[rowSums(res)]

[1] "stats is gr8" "stats is gr8"

Demo

The basic idea is to generate a matrix whose rows are the tweets and whose columns are the keywords. Should we find one or more 1 (true) values across a given row, it means that tweet fired on one or more keywords.

Note carefully that I surround each search term by word boundaries \b. This is necessary that a search term does not falsely match as a substring of a larger word.

Upvotes: 1

drJones
drJones

Reputation: 1233

There are many ways to improve this. But a quick solution with minimal changes to your code:

shortText= c('grt','gr8','bcz','ur')


tweet=c('stats is gr8','this car is good','your movie is grt','i hate your book of hatred','food is awsome'
)
tweet=data.frame(tweet, stringsAsFactors = FALSE)

for(row in 1:nrow(tweet)) {

  tweetWords=strsplit(tweet[row,]," ")
  print(tweetWords)
  for (word in 1:length(tweetWords)) {
    if(any(tweetWords[word][[1]] %in% shortText)){
      print('we have a match')
    }

  }
}

returns:

[[1]]
[1] "stats" "is"    "gr8"  

[1] "we have a match"
[[1]]
[1] "this" "car"  "is"   "good"

[[1]]
[1] "your"  "movie" "is"    "grt"  

[1] "we have a match"
[[1]]
[1] "i"      "hate"   "your"   "book"   "of"     "hatred"

[[1]]
[1] "food"   "is"     "awsome"

Adding any will execute the if statement if any of the boolean operators are T, without it it would have used the first element only

Upvotes: 0

Related Questions