nesta1990
nesta1990

Reputation: 295

Text dictionary-based sentiment analysis (tidytext)

I think I have done all the steps necessary to prepare my textual data for dictionary-based sentiment analysis, but I am struggling to run the sentiment analysis itself. Specifically, I have removed unnecessary characters, stemming, and stop words, but I am not sure how to run the sentiment analysis itself as shown below.

#Loading packages
library(tidyverse)
library(textdata)
library(tidytext)
require(writexl)
library(quanteda)

data example

dput(df[1:5,c(1,2,3)])

output:

structure(list(id = 1:5, username = c("106gunner", "CPTMiller", 
"matey1982", "Why so serious", "Joe Maya"), post = c("Was reported in SCMP news source underneath link", 
"Government already said ft or CECA create new good jobs for Singaporean", 
"gunner said Was reported in SCMP news source underneath linkClick to expand arent u stating the obvious", 
"lightboxclose Close lightboxnext Next lightboxprevious Previous lightboxerror The requested content cannot be loaded Please try again later lightboxstartslideshow Start slideshow lightboxstopslideshow Stop slideshow lightboxfullscreen Full screen lightboxthumbnails Thumbnails lightboxdownload Download lightboxshare Share lightboxzoom Zoom lightboxnewwindow New window lightboxtogglesidebar Toggle sidebar", 
"From personal experience i lost my job to jhk")), row.names = c(NA, 
-5L), class = c("tbl_df", "tbl", "data.frame"))
## Remove specific characters that add no value to the post.
strings_to_remove <- c("click","expand","Click","to", "can", "like", "also", "go", "just", "even", "now", "see", "got", "another", "dont", 
                       "know",">" ,"get","ones","team","didnt","first","mostly","old", "long", "time", "well", 
                       "going", "think", "still", "wanted", "instead", "times", "years", "high", "big", "thats", "using")

regex<-paste(paste0("(^|\\s+)", strings_to_remove, "\\.?", "(?=\\s+|$)"),collapse="|")

df_test <- corpus_all %>% 
  mutate(post = str_remove_all(post, regex))

df_test$post <- gsub("Click to expand", "", df_test$post)

#Converting dataframe into a corpus object
df<- corpus(df_test,
                        docid_field = "id",
                        text_field = "post")

#Loading list of coloquial stop words 
stopwords <- c(stopwords("en", source = "marimo"))


#Obtaining a DTM removing punctuation, numbers, and stopwords
toks <- tokens(df, 
               remove_punct = TRUE, 
               remove_numbers = TRUE) %>% 
  tokens_remove(pattern = stopwords)

dtm_c <- dfm(toks)

#stopwords can be removed from other sources such as "misc"
#looking at the list it seems like marimo has more words

#Looking at number of features:
dtm_c

#Stemming to reduce multiple conjugations/forms of a word to its root
tab  <- dfm_wordstem(dtm_c, language = "en")

tab<- na.omit(tab)

head(tab)

I then ran the code below based on the solution here but I am unable to solve the error message that I receive

"Error in UseMethod("inner_join") : no applicable method for 'inner_join' applied to an object of class "tokens"

#get the sentiment from the first text: 
toks %>%
  inner_join(get_sentiments("bing")) %>% # pull out only sentiment words
  dplyr::count(sentiment) %>% # count the # of positive & negative words
  spread(sentiment, n, fill = 0) %>% # made data wide rather than narrow
  mutate(sentiment = positive - negative) # # of positive words - # of negative owrds

Upvotes: 1

Views: 445

Answers (3)

Ken Benoit
Ken Benoit

Reputation: 14902

The quanteda package offers an alternative way to compute sentiment easily, through the quanteda.sentiments package. It can compute sentiment for either "polarity" dictionaries (lists of positive and negative words) or "valence" dictionaries, lists of words with numerical scores for sentiment.

df <- structure(list(id = 1:5, username = c("106gunner", "CPTMiller", 
                                      "matey1982", "Why so serious", "Joe Maya"), post = c("Was reported in SCMP news source underneath link", 
                                                                                           "Government already said ft or CECA create new good jobs for Singaporean", 
                                                                                           "gunner said Was reported in SCMP news source underneath linkClick to expand arent u stating the obvious", 
                                                                                           "lightboxclose Close lightboxnext Next lightboxprevious Previous lightboxerror The requested content cannot be loaded Please try again later lightboxstartslideshow Start slideshow lightboxstopslideshow Stop slideshow lightboxfullscreen Full screen lightboxthumbnails Thumbnails lightboxdownload Download lightboxshare Share lightboxzoom Zoom lightboxnewwindow New window lightboxtogglesidebar Toggle sidebar", 
                                                                                           "From personal experience i lost my job to jhk")), row.names = c(NA, 
                                                                                                                                                            -5L), class = c("tbl_df", "tbl", "data.frame"))

library("quanteda")
#> Package version: 3.3.1
#> Unicode version: 14.0
#> ICU version: 71.1
#> Parallel computing: 10 of 10 threads used.
#> See https://quanteda.io for tutorials and examples.
corp <- corpus(df, text_field = "post", docid_field = "id")
toks <- tokens(corp)

# remotes::install_github("quanteda/quanteda.sentiment")
library("quanteda.sentiment")
#> 
#> Attaching package: 'quanteda.sentiment'
#> The following object is masked from 'package:quanteda':
#> 
#>     data_dictionary_LSD2015

Computing sentiment is then just a matter of calling the functions and supplying one of the built-in dictionaries.

textstat_polarity(toks, dictionary = data_dictionary_HuLiu)
#>   doc_id sentiment
#> 1      1  0.000000
#> 2      2  1.098612
#> 3      3  0.000000
#> 4      4  0.000000
#> 5      5 -1.098612

textstat_valence(toks, dictionary = data_dictionary_AFINN)
#>   doc_id  sentiment
#> 1      1  0.0000000
#> 2      2  3.0000000
#> 3      3  1.0000000
#> 4      4  0.3333333
#> 5      5 -3.0000000

Created on 2023-09-07 with reprex v2.0.2

Upvotes: 1

andrew_reece
andrew_reece

Reputation: 21274

The get_sentiments() function returns a tibble, so anything you want to join to it also needs to be a tibble, or data frame. But toks is a special tokens object from the quanteda package - it is a kind of named list with extra add-ons.

I would recommend just doing everything in either tidytext or quanteda, but if you need to mix for some reason, use as.list() on toks, then use map to look up the tokens for each username in your sentiment dictionary of choice.

as.list(toks) |> 
  set_names(df$username) |> 
  map(\(tok_list) tibble(word = tok_list) |> 
        left_join(get_sentiments('bing'))) |> 
  list_rbind(names_to = 'user') |>
  tail() # just to show some output

# A tibble: 6 × 3
  user     word  sentiment
  <chr>    <chr> <chr>    
1 Joe Maya i     NA       
2 Joe Maya lost  negative 
3 Joe Maya my    NA       
4 Joe Maya job   NA       
5 Joe Maya to    NA       
6 Joe Maya jhk   NA   

Note that get_sentiments() has its tokens in a column named word, so it's convenient to also have a word column in the dataset you want to join on.

Upvotes: 1

Christopher Belanger
Christopher Belanger

Reputation: 631

You're running into problems because you're using two great packages, tidytext and quanteda, that work in different ways and don't always work well together. tidytext works with regular data frames or tibbles, and quanteda works with custom data structures.

Here's a reproducible example that uses your data to find the sentiment-bearing words using tidytext and the bing sentiment dictionary. It has three steps:

  1. It unnests the column "post" into a new column called "word", with one row for each word.
  2. Then, it removes a set of stop words (junk words like "it", "a", and so on) using a library built into the package tidytext.
  3. Then it joins our column of remaining words with the bing sentiment dictionary.
library(tidytext)
library(dplyr)

df <- structure(
  list(
    id = 1:5, username = c(
      "106gunner", "CPTMiller", "matey1982", "Why so serious", "Joe Maya"), 
    post = c("Was reported in SCMP news source underneath link", 
             "Government already said ft or CECA create new good jobs for Singaporean", 
             "gunner said Was reported in SCMP news source underneath linkClick to expand arent u stating the obvious", 
             "lightboxclose Close lightboxnext Next lightboxprevious Previous lightboxerror The requested content cannot be loaded Please try again later lightboxstartslideshow Start slideshow lightboxstopslideshow Stop slideshow lightboxfullscreen Full screen lightboxthumbnails Thumbnails lightboxdownload Download lightboxshare Share lightboxzoom Zoom lightboxnewwindow New window lightboxtogglesidebar Toggle sidebar", 
             "From personal experience i lost my job to jhk")), 
  row.names = c(NA, -5L), class = c("tbl_df", "tbl", "data.frame"))


# Step 1: unnest the column "post" into a new column called "word", with one row
# for each word.
# Step 2: anti_join() with a set of stopwords (junk words we don't care about)
# Step 3: join our column of remaining words with the bing sentiment dictionary
df |>
  tidytext::unnest_tokens(output="word", input="post") |>
  dplyr::anti_join(tidytext::stop_words)|>
  dplyr::inner_join(tidytext::get_sentiments("bing"))

As a next step you could use dplyr::group() and dplyr::summarize() to count instances of positive or negative words, or look at other dictionaries that sometimes give numeric weightings instead of just positive/negative ratings.

Please note also that your original example was not reproducible on my machine, because the variable corpus_all doesn't seem to be defined. You'll get better answers if you post reproducible examples.

Upvotes: 3

Related Questions