Count words in a text based on a lexicon

Question

I have this kind of data:

library(dplyr)

glimpse(samp)
Observations: 10
Variables: 2
$ text  "@VirginAmerica What @dhepburn said.", "@VirginAmerica plus you've ...
$ airline_sentiment  "neutral", "positive", "neutral", "negative", "negative", "negative...

I want to compare the occurrence of the words in the text variable with the words in a lexicon, i.e. I want to count how often a certain word appears in the text based on the lexicon.

The lexicon looks like this

library(lexicon)
hash_sentiment_sentiword[1:5]
             x     y
1:    365 days -0.50
2:    366 days  0.25
3:         3tc -0.25
4:  a fortiori  0.25
5: a good deal  0.25

I know there are functions like str_detect. However, from this, I only get true/false values.

The result should be like this (pseudo code):

   text     x        y      n    
1. word 1   word 1   score  2
2. word 2   word 2   score  1
3. word 3   word 3   score  10
4. word 4   word 4   score  0
5. word 5   word 5   score  0
...

text: A word of the text column from samp; x and y: x and y column from hash_sentiment_sentiword; n: Frequency of the appearance of a word of x in the text, For example, the word "awesome" is in x and appears one time in the text. So for "awesome" n would be 1. "country" is not in x but in the text. So n would be 0.

Here is a small dput():

dput(samp)

structure(list(text = c("@VirginAmerica Thanks!", "@VirginAmerica SFO-PDX schedule is still MIA.", 
"@VirginAmerica So excited for my first cross country flight LAX to MCO I've heard nothing but great things about Virgin America. #29DaysToGo", 
"@VirginAmerica  I flew from NYC to SFO last week and couldn't fully sit in my seat due to two large gentleman on either side of me. HELP!", 
"I  flying @VirginAmerica. ", 
"@VirginAmerica you know what would be amazingly awesome? BOS-FLL PLEASE!!!!!!! I want to fly with only you."
), airline_sentiment = c("positive", "negative", "positive", 
"negative", "positive", "positive")), row.names = 15:20, class = "data.frame")

phiver · Accepted Answer

One way of doing this, and there are as many as there are text-mining packages, is using tidytext. I chose tidytext because you are using dplyr and this plays nice with this. I'm using an inner_join to join the lexicon with your data. Change this to a left_join if you want to keep the words that are not a match in the lexicon.

library(tidytext)
library(dplyr)
samp %>% 
  unnest_tokens(text, output = "words", token = "tweets") %>% 
  inner_join(lexicon::hash_sentiment_sentiword, by = c("words" = "x")) %>% 
  group_by(words, y) %>% 
  summarise(n = n()) 

# A tibble: 20 x 3
# Groups:   words [?]
   words          y     n
          
 1 about      0.25      1
 2 amazingly  0.125     1
 3 cross     -0.75      1
 4 due        0.25      1
 5 excited    0         1
 6 first      0.375     1
 7 fly       -0.5       1
 8 fully      0.375     1
 9 help       0.208     1
10 know       0.188     1
11 large     -0.25      1
12 last      -0.208     1
13 lax       -0.375     1
14 on         0.125     1
15 please     0.125     1
16 side      -0.125     1
17 still     -0.107     1
18 thanks     0         1
19 virgin     0.25      1
20 want       0.125     1

extra info for tidytext: tidy text mining with R

cran task view Natural Language Programming

other packages: quanteda, qdap, sentimentr, udpipe

Count words in a text based on a lexicon

Answers (2)

Related Questions