Reputation: 295
I think I have done all the steps necessary to prepare my textual data for dictionary-based sentiment analysis, but I am struggling to run the sentiment analysis itself. Specifically, I have removed unnecessary characters, stemming, and stop words, but I am not sure how to run the sentiment analysis itself as shown below.
#Loading packages
library(tidyverse)
library(textdata)
library(tidytext)
require(writexl)
library(quanteda)
data example
dput(df[1:5,c(1,2,3)])
output:
structure(list(id = 1:5, username = c("106gunner", "CPTMiller",
"matey1982", "Why so serious", "Joe Maya"), post = c("Was reported in SCMP news source underneath link",
"Government already said ft or CECA create new good jobs for Singaporean",
"gunner said Was reported in SCMP news source underneath linkClick to expand arent u stating the obvious",
"lightboxclose Close lightboxnext Next lightboxprevious Previous lightboxerror The requested content cannot be loaded Please try again later lightboxstartslideshow Start slideshow lightboxstopslideshow Stop slideshow lightboxfullscreen Full screen lightboxthumbnails Thumbnails lightboxdownload Download lightboxshare Share lightboxzoom Zoom lightboxnewwindow New window lightboxtogglesidebar Toggle sidebar",
"From personal experience i lost my job to jhk")), row.names = c(NA,
-5L), class = c("tbl_df", "tbl", "data.frame"))
## Remove specific characters that add no value to the post.
strings_to_remove <- c("click","expand","Click","to", "can", "like", "also", "go", "just", "even", "now", "see", "got", "another", "dont",
"know",">" ,"get","ones","team","didnt","first","mostly","old", "long", "time", "well",
"going", "think", "still", "wanted", "instead", "times", "years", "high", "big", "thats", "using")
regex<-paste(paste0("(^|\\s+)", strings_to_remove, "\\.?", "(?=\\s+|$)"),collapse="|")
df_test <- corpus_all %>%
mutate(post = str_remove_all(post, regex))
df_test$post <- gsub("Click to expand", "", df_test$post)
#Converting dataframe into a corpus object
df<- corpus(df_test,
docid_field = "id",
text_field = "post")
#Loading list of coloquial stop words
stopwords <- c(stopwords("en", source = "marimo"))
#Obtaining a DTM removing punctuation, numbers, and stopwords
toks <- tokens(df,
remove_punct = TRUE,
remove_numbers = TRUE) %>%
tokens_remove(pattern = stopwords)
dtm_c <- dfm(toks)
#stopwords can be removed from other sources such as "misc"
#looking at the list it seems like marimo has more words
#Looking at number of features:
dtm_c
#Stemming to reduce multiple conjugations/forms of a word to its root
tab <- dfm_wordstem(dtm_c, language = "en")
tab<- na.omit(tab)
head(tab)
I then ran the code below based on the solution here but I am unable to solve the error message that I receive
"Error in UseMethod("inner_join") : no applicable method for 'inner_join' applied to an object of class "tokens"
#get the sentiment from the first text:
toks %>%
inner_join(get_sentiments("bing")) %>% # pull out only sentiment words
dplyr::count(sentiment) %>% # count the # of positive & negative words
spread(sentiment, n, fill = 0) %>% # made data wide rather than narrow
mutate(sentiment = positive - negative) # # of positive words - # of negative owrds
Upvotes: 1
Views: 445
Reputation: 14902
The quanteda package offers an alternative way to compute sentiment easily, through the quanteda.sentiments package. It can compute sentiment for either "polarity" dictionaries (lists of positive and negative words) or "valence" dictionaries, lists of words with numerical scores for sentiment.
df <- structure(list(id = 1:5, username = c("106gunner", "CPTMiller",
"matey1982", "Why so serious", "Joe Maya"), post = c("Was reported in SCMP news source underneath link",
"Government already said ft or CECA create new good jobs for Singaporean",
"gunner said Was reported in SCMP news source underneath linkClick to expand arent u stating the obvious",
"lightboxclose Close lightboxnext Next lightboxprevious Previous lightboxerror The requested content cannot be loaded Please try again later lightboxstartslideshow Start slideshow lightboxstopslideshow Stop slideshow lightboxfullscreen Full screen lightboxthumbnails Thumbnails lightboxdownload Download lightboxshare Share lightboxzoom Zoom lightboxnewwindow New window lightboxtogglesidebar Toggle sidebar",
"From personal experience i lost my job to jhk")), row.names = c(NA,
-5L), class = c("tbl_df", "tbl", "data.frame"))
library("quanteda")
#> Package version: 3.3.1
#> Unicode version: 14.0
#> ICU version: 71.1
#> Parallel computing: 10 of 10 threads used.
#> See https://quanteda.io for tutorials and examples.
corp <- corpus(df, text_field = "post", docid_field = "id")
toks <- tokens(corp)
# remotes::install_github("quanteda/quanteda.sentiment")
library("quanteda.sentiment")
#>
#> Attaching package: 'quanteda.sentiment'
#> The following object is masked from 'package:quanteda':
#>
#> data_dictionary_LSD2015
Computing sentiment is then just a matter of calling the functions and supplying one of the built-in dictionaries.
textstat_polarity(toks, dictionary = data_dictionary_HuLiu)
#> doc_id sentiment
#> 1 1 0.000000
#> 2 2 1.098612
#> 3 3 0.000000
#> 4 4 0.000000
#> 5 5 -1.098612
textstat_valence(toks, dictionary = data_dictionary_AFINN)
#> doc_id sentiment
#> 1 1 0.0000000
#> 2 2 3.0000000
#> 3 3 1.0000000
#> 4 4 0.3333333
#> 5 5 -3.0000000
Created on 2023-09-07 with reprex v2.0.2
Upvotes: 1
Reputation: 21274
The get_sentiments()
function returns a tibble, so anything you want to join to it also needs to be a tibble, or data frame. But toks
is a special tokens
object from the quanteda
package - it is a kind of named list with extra add-ons.
I would recommend just doing everything in either tidytext or quanteda, but if you need to mix for some reason, use as.list()
on toks
, then use map
to look up the tokens for each username
in your sentiment dictionary of choice.
as.list(toks) |>
set_names(df$username) |>
map(\(tok_list) tibble(word = tok_list) |>
left_join(get_sentiments('bing'))) |>
list_rbind(names_to = 'user') |>
tail() # just to show some output
# A tibble: 6 × 3
user word sentiment
<chr> <chr> <chr>
1 Joe Maya i NA
2 Joe Maya lost negative
3 Joe Maya my NA
4 Joe Maya job NA
5 Joe Maya to NA
6 Joe Maya jhk NA
Note that get_sentiments()
has its tokens in a column named word
, so it's convenient to also have a word
column in the dataset you want to join on.
Upvotes: 1
Reputation: 631
You're running into problems because you're using two great packages, tidytext
and quanteda
, that work in different ways and don't always work well together. tidytext
works with regular data frames or tibbles, and quanteda
works with custom data structures.
Here's a reproducible example that uses your data to find the sentiment-bearing words using tidytext
and the bing sentiment dictionary. It has three steps:
tidytext
.library(tidytext)
library(dplyr)
df <- structure(
list(
id = 1:5, username = c(
"106gunner", "CPTMiller", "matey1982", "Why so serious", "Joe Maya"),
post = c("Was reported in SCMP news source underneath link",
"Government already said ft or CECA create new good jobs for Singaporean",
"gunner said Was reported in SCMP news source underneath linkClick to expand arent u stating the obvious",
"lightboxclose Close lightboxnext Next lightboxprevious Previous lightboxerror The requested content cannot be loaded Please try again later lightboxstartslideshow Start slideshow lightboxstopslideshow Stop slideshow lightboxfullscreen Full screen lightboxthumbnails Thumbnails lightboxdownload Download lightboxshare Share lightboxzoom Zoom lightboxnewwindow New window lightboxtogglesidebar Toggle sidebar",
"From personal experience i lost my job to jhk")),
row.names = c(NA, -5L), class = c("tbl_df", "tbl", "data.frame"))
# Step 1: unnest the column "post" into a new column called "word", with one row
# for each word.
# Step 2: anti_join() with a set of stopwords (junk words we don't care about)
# Step 3: join our column of remaining words with the bing sentiment dictionary
df |>
tidytext::unnest_tokens(output="word", input="post") |>
dplyr::anti_join(tidytext::stop_words)|>
dplyr::inner_join(tidytext::get_sentiments("bing"))
As a next step you could use dplyr::group()
and dplyr::summarize()
to count instances of positive or negative words, or look at other dictionaries that sometimes give numeric weightings instead of just positive/negative ratings.
Please note also that your original example was not reproducible on my machine, because the variable corpus_all
doesn't seem to be defined. You'll get better answers if you post reproducible examples.
Upvotes: 3