gaurav v
gaurav v

Reputation: 63

Sentiment analysis for tidytext in R

I am trying to perform sentiment analysis in R. I want to use either afinn or bing lexicon, but the problem is i cant tokenize the words.

Here are the words for which i need the sentiments for :

sentiment_words

So there are 6 words for whom i want sentiments for : Pass Fail Not Ready Out of Business Pass w/conditions No entry

How do i use any of the lexicons to assign sentiments to these words

Here is my code :

d<- as.data.frame(data$Results)
d<- as.data.frame(d[1:2000,])

colnames(d) <- "text"



#Making preprocessed file for raw data
preprocess<-data.frame(text=sapply(tweet_corpus_clean, identity), 
                       stringsAsFactors=F)

# tokenize
tokens <- data_frame(text = preprocess$text) %>% unnest_tokens(word, text)

When run this i get :

senti_new

Because for lexicons to assign sentiments it has to be one token per row

So i had to merge those words together. Now when i use afinn its not able to understand what outofbusiness is obvioulsy

tokens <- data_frame(text = preprocess$text) %>% unnest_tokens(word, text)


contributions = tokens %>%ungroup()%>%
  inner_join(get_sentiments("afinn"), by = "word") %>%
  group_by(word) %>%
  summarize(score = as.numeric(sum(score * n) / sum(n))) %>%
  arrange(desc(sentiment))

how do i do sentiment analysis for those 6 tpes of words?

Upvotes: 0

Views: 625

Answers (1)

Julia Silge
Julia Silge

Reputation: 11613

Hmmmm, this doesn't sounds like a sentiment analysis problem to me. You have six words/phrases that you know about exactly, and you know what they mean in your context. This sounds like you just want to assign these words/phrases scores, or even just levels of a factor.

You could do something like what I show here, where you as the analyst decide what score each of your phrases should have. Here, scores is the dataframe that you as the analyst construct with sensibly chosen scores for each text options, and df is the data you are analyzing.

library(dplyr)

scores <- data_frame(text = c("pass",
                              "fail",
                              "not ready",
                              "out of business",
                              "pass w/conditions",
                              "no entry"),
                     score = c(3, -1, 0, 0, 2, 1))

scores
#> # A tibble: 6 x 2
#>   text              score
#>   <chr>             <dbl>
#> 1 pass               3.00
#> 2 fail              -1.00
#> 3 not ready          0   
#> 4 out of business    0   
#> 5 pass w/conditions  2.00
#> 6 no entry           1.00

df <- data_frame(text = c("pass",
                          "pass",
                          "fail",
                          "not ready",
                          "out of business",
                          "no entry",
                          "fail",
                          "pass w/conditions",
                          "fail",
                          "no entry",
                          "pass w/conditions"))

df %>%
  left_join(scores)
#> Joining, by = "text"
#> # A tibble: 11 x 2
#>    text              score
#>    <chr>             <dbl>
#>  1 pass               3.00
#>  2 pass               3.00
#>  3 fail              -1.00
#>  4 not ready          0   
#>  5 out of business    0   
#>  6 no entry           1.00
#>  7 fail              -1.00
#>  8 pass w/conditions  2.00
#>  9 fail              -1.00
#> 10 no entry           1.00
#> 11 pass w/conditions  2.00

Sentiment analysis is most appropriate where you have large amounts of unstructured text that you need to extract insight from. Here you have only six text elements, and you can use what you know about your domain and context to assign scores.

Upvotes: 1

Related Questions