Mehrdad Rohani
Mehrdad Rohani

Reputation: 181

string matching in R

I have 4 words. They are wordA, wordB, wordX and wordY. I have a data set which consists of 1 column (message) and data type of message column is factor. I want to count the total number of occurrences of (wordX and wordY) and then subtracts it from occurrences of (wordA and wordB) in each row and then puting the result in a new column in the row.

For example if text of a message column is "wordD wordA wordX wordA wordC wordA wordB wordY" then the value should be equal to wordA-wordX+wordA+wordA+wordB-wordY= 1-1+1+1+1-1= +2 .

I wrote this code but it doesn't count duplicated words. I appreciate if you could help me.

for(i in 1:nrow(dataset){
counter=0

if(length(grep("wordA",dataset[i,1],)==1)){
counter=counter+1;
}
if(length(grep("wordB",dataset[i,1])==1)){
counter=counter+1;
}
if(length(grep("wordX",dataset[i,1])==1)){
counter=counter-1;
}
if(length(grep("wordY",dataset[i,1])==1)){
counter=counter-1;
}
dataset[i,2]=counter;
}   

Upvotes: 0

Views: 748

Answers (2)

BartekCh
BartekCh

Reputation: 920

You could use gregexpr also, which founds every occurrence of given pattern and outputs starting positions of every match.

messages <- c("wordD wordA wordX wordA wordC wordA wordB wordY",
              "wordX wordA wordY wordY wordC wordD wordB wordY",
              "wordB wordA wordX wordA wordB wordA wordB wordY")
score <- sapply(gregexpr("wordA|wordB", messages), length) - 
            sapply(gregexpr("wordX|wordY", messages), length)

Upvotes: 2

Jota
Jota

Reputation: 17611

I'm not entirely sure If this is what you're looking for, but here is what I thought you might be asking. You want to score each element of a vector of sentences or phrases (e.g. mess<-c("some stuff here", "some stuff not here", "most stuff here") according to which words are present. The presence of some words adds +1 to the score, and the presence of other words adds -1 to the score. In my example the words that add +1 are "here" and "stuff" and the words that add -1 are "some" and "most".

# vector  
mess <- c("some stuff here", "some stuff not here", "most stuff here")

positiveword <- lapply(strsplit(mess," "), function(x)grepl("here|stuff",x))
positiveword <- lapply(positiveword, sum)

negativeword <- lapply(strsplit(mess," "), function(x)grepl("some|most",x))
negativeword <- lapply(negativeword, sum)
score <- unlist(positiveword) - unlist(negativeword)

Upvotes: 1

Related Questions