Reputation: 83
I have formed a new lexicon dictionary to analyse the sentiment of sentences in R. I have used lexicon dictionaries before using R, but I unsure how to use my own. I managed to create positive and negative list of words, which counts the number of positive and negative words, then providing a sum. This does not take into account the scores allocated to each word as shown in the example below.
I would like to analyse say this sentence "I am happy and kind of sad". Example list of words and scores (list would be bigger than this):
happy, 1.3455
sad, -1.0552
I would like to match these words with the sentence and take the sum of the scores, 1.3455 + -1.0552, which in this case gives an overall score of 0.2903.
How would I go about in taking the actual score for each word to provide an overall score when analysing the sentiment of each sentence in R as emphasised in the example above?
Many thanks, James
Upvotes: 1
Views: 1474
Reputation: 9485
You can start with the magnificent tidytext
package:
library(tidytext)
library(tidyverse)
First, your data to analyze, and a small transformation:
# data
df <-data_frame(text = c('I am happy and kind of sad','sad is sad, happy is good'))
# add and ID
df <- tibble::rowid_to_column(df, "ID")
# add the name of the ID column
colnames(df)[1] <- "line"
> df
# A tibble: 1 x 2
line text
<int> <chr>
1 1 I am happy and kind of sad
Then you could work them to make words in column. This is a "loop" that is applied to each sentence (each id):
tidy <- df %>% unnest_tokens(word, text)
> tidy
# A tibble: 7 x 2
line word
<int> <chr>
1 1 i
2 1 am
3 1 happy
4 1 and
5 1 kind
6 1 of
7 1 sad
Now your brand new lexicon:
lexicon <- data_frame(word =c('happy','sad'),scores=c(1.3455,-1.0552))
> lexicon
# A tibble: 2 x 2
word scores
<chr> <dbl>
1 happy 1.35
2 sad -1.06
Lastly, you can merge
lexicon and data to have the sum of the scores.
merged <- merge(tidy,lexicon, by = 'word')
Now for each phrase, the sentiment:
scoredf <- aggregate(cbind(scores) ~line, data = merged, sum)
>scoredf
line scores
1 1 0.2903
2 2 -0.7649
Lastly you can merge
the initial df with the scores, to have phrases and scores together:
scoredf <- aggregate(cbind(scores) ~line, data = merged, sum)
merge(df,scoredf, by ='line')
line text scores
1 1 I am happy and kind of sad 0.2903
2 2 sad is sad, happy is good -0.7649
In case you want for multiple phrases the overall sentiment scores.
Hope it helps!
Upvotes: 3