Reputation: 445
I have a dataset containing a column of strings from which I wish to calculate an overall sentiment score, and a data frame containing all the unique words that appear in all the strings , each of which is assigned a score:
library(stringr)
df <- data.frame(text = c('recommend good value no problem','terrible quality no good','good service excellent quality commend'), score = 0)
words <- c('recommend','good','value','problem','terrible','quality','service','excellent','commend')
scores <- c(1,2,1,-2,-3,1,0,3,1)
wordsdf <- data.frame(words,scores)
The only way I have been able to get close to this is by using a nested for loop and the str_count function from the stringr package:
for (i in 1:3){
count = 0
for (j in 1:9){
count <- count + (str_count(df$text[i],as.character(wordsdf$words[j])) * wordsdf$scores[j])
}
df$score[i] <- count
}
This almost achieves what I want:
text score
1 recommend good value no problem 3
2 terrible quality no good 0
3 good service excellent quality commend 7
However, since the word 'commend' is also contained in the word 'recommend', my code calculates the scores as if both words are contained in the string.
So I have two queries: 1 - Is there a way to get it to match only to exact words? 2 - Is there a way to achieve this without using the nested loop?
Upvotes: 1
Views: 449
Reputation: 39858
One tidyverse
possibility could be:
df %>%
rowid_to_column() %>%
mutate(text = strsplit(text, " ", fixed = TRUE)) %>%
unnest() %>%
full_join(wordsdf, by = c("text" = "words")) %>%
group_by(rowid) %>%
summarise(text = paste(text, collapse = " "),
scores = sum(scores, na.rm = TRUE)) %>%
ungroup() %>%
select(-rowid)
text scores
<chr> <dbl>
1 recommend good value no problem 2
2 terrible quality no good 0
3 good service excellent quality commend 7
It, first, splits the "text" column into separate words. Second, it performs a full join on these words. Finally, it combines the words from "text" column again and performs the summation.
Upvotes: 3