Reputation: 1251
I have this kind of data:
library(dplyr)
glimpse(samp)
Observations: 10
Variables: 2
$ text <chr> "@VirginAmerica What @dhepburn said.", "@VirginAmerica plus you've ...
$ airline_sentiment <chr> "neutral", "positive", "neutral", "negative", "negative", "negative...
I want to compare the occurrence of the words in the text variable with the words in a lexicon, i.e. I want to count how often a certain word appears in the text based on the lexicon.
The lexicon looks like this
library(lexicon)
hash_sentiment_sentiword[1:5]
x y
1: 365 days -0.50
2: 366 days 0.25
3: 3tc -0.25
4: a fortiori 0.25
5: a good deal 0.25
I know there are functions like str_detect
. However, from this, I only get true/false values.
The result should be like this (pseudo code):
text x y n
1. word 1 word 1 score 2
2. word 2 word 2 score 1
3. word 3 word 3 score 10
4. word 4 word 4 score 0
5. word 5 word 5 score 0
...
text: A word of the text column from samp
;
x and y: x and y column from hash_sentiment_sentiword
;
n: Frequency of the appearance of a word of x in the text, For example, the word "awesome" is in x and appears one time in the text. So for "awesome" n would be 1. "country" is not in x but in the text. So n would be 0.
Here is a small dput()
:
dput(samp)
structure(list(text = c("@VirginAmerica Thanks!", "@VirginAmerica SFO-PDX schedule is still MIA.",
"@VirginAmerica So excited for my first cross country flight LAX to MCO I've heard nothing but great things about Virgin America. #29DaysToGo",
"@VirginAmerica I flew from NYC to SFO last week and couldn't fully sit in my seat due to two large gentleman on either side of me. HELP!",
"I <U+2764><U+FE0F> flying @VirginAmerica. <U+263A><U+FE0F><U+0001F44D>",
"@VirginAmerica you know what would be amazingly awesome? BOS-FLL PLEASE!!!!!!! I want to fly with only you."
), airline_sentiment = c("positive", "negative", "positive",
"negative", "positive", "positive")), row.names = 15:20, class = "data.frame")
Upvotes: 0
Views: 301
Reputation: 5281
Here is a base R
solution
# create an array of all the words in samp$text
# optional: use regex to remove punctuation symbols (this can be refined)
textWords <- unlist(strsplit(gsub('[[:punct:]]','',samp$text,perl=TRUE), ' '))
# count occurences of each word and store it as data frame
occurences <- unique(data.frame(text = textWords,
n = as.integer(ave(textWords, textWords, FUN = length)),
stringsAsFactors = FALSE))
# get words of x with scores y
xWordsList <- setNames(strsplit(lexicon::hash_sentiment_sentiword$x, ' '),
lexicon::hash_sentiment_sentiword$y)
# create the result data frame
res <- data.frame(x = unlist(xWordsList), y = rep(names(xWordsList), lengths(xWordsList)))
rm(xWordsList) # removing as object is rather large and not needed anymore
# subset to keep only x elements which are in text
res <- res[res$x %in% textWords,]
# match occurences
res$n <- vapply(1:nrow(res),
function (k) occurences$n[occurences$text == res$x[k]],
integer(1))
rownames(res) <- 1:nrow(res)
# a glimpse at the result
head(res)
# x y n
# 1 great 0.3125 1
# 2 in -0.125 1
# 3 about 0.25 1
# 4 of 0.125 1
# 5 of -0.125 1
# 6 to 0.125 4
This can be enhanced (e.g. via .subset2
or refining the regex
) here and there. Also, note that I omitted the column text
in res
as that column is by definition identical to the column x
.
Upvotes: 1
Reputation: 23608
One way of doing this, and there are as many as there are text-mining packages, is using tidytext. I chose tidytext because you are using dplyr and this plays nice with this. I'm using an inner_join to join the lexicon with your data. Change this to a left_join if you want to keep the words that are not a match in the lexicon.
library(tidytext)
library(dplyr)
samp %>%
unnest_tokens(text, output = "words", token = "tweets") %>%
inner_join(lexicon::hash_sentiment_sentiword, by = c("words" = "x")) %>%
group_by(words, y) %>%
summarise(n = n())
# A tibble: 20 x 3
# Groups: words [?]
words y n
<chr> <dbl> <int>
1 about 0.25 1
2 amazingly 0.125 1
3 cross -0.75 1
4 due 0.25 1
5 excited 0 1
6 first 0.375 1
7 fly -0.5 1
8 fully 0.375 1
9 help 0.208 1
10 know 0.188 1
11 large -0.25 1
12 last -0.208 1
13 lax -0.375 1
14 on 0.125 1
15 please 0.125 1
16 side -0.125 1
17 still -0.107 1
18 thanks 0 1
19 virgin 0.25 1
20 want 0.125 1
extra info for tidytext: tidy text mining with R
cran task view Natural Language Programming
other packages: quanteda, qdap, sentimentr, udpipe
Upvotes: 1