Reputation: 175
When doing a sentiment analysis in R using dplyr
that has been described in this post, it appears that some of my rows go missing. I've provided a set of 6 Dutch sentences. As can be seen, row 3
and 6
do not appear in the new df
that includes the sentiment analysis.
I tried to change the "drop"
to "keep"
, "drop"
and "NULL"
. I also tried to hashtag certain parts after the df %>%
solution, but both without result.
Is someone able to explain this behavior to me? And how can I fix it?
library(tidyverse)
library(xml2)
library(tidytext)
#Example data set
text = c("Slechte bediening, van begin tot eind",
"Het eten was heerlijk en de bediening was fantastisch",
"Geweldige service en beleefde bediening",
"Verschrikkelijk. Ik had een vlieg in mijn soep",
"Het was oké. De bediening kon wat beter, maar het eten was wel lekker. Leuk sfeertje wel!",
"Ondanks dat het druk was toch op tijd ons eten gekregen. Complimenten aan de kok voor het op smaak brengen van mijn biefstuk")
identifier <- c("3", "4", "6", "7", "1", "5")
df <- data.frame(identifier, text)
#Sentiment analysis Dutch
sentiment_nl <- read_xml(
"https://raw.githubusercontent.com/clips/pattern/master/pattern/text/nl/nl-sentiment.xml"
) %>%
as_list() %>%
.[[1]] %>%
map_df(function(x) {
tibble::enframe(attributes(x))
}) %>%
mutate(id = cumsum(str_detect("form", name))) %>%
unnest(value) %>%
pivot_wider(id_cols = id) %>%
mutate(polarity = as.numeric(polarity),
subjectivity = as.numeric(subjectivity),
intensity = as.numeric(intensity),
confidence = as.numeric(confidence))
df <- df %>%
mutate(identifier = identifier) %>%
unnest_tokens(output = word, input = text, drop = FALSE) %>%
inner_join(sentiment_nl, by = c("word" = "form")) %>%
group_by(identifier) %>%
summarise(text = head(text, 1),
polarity = mean(polarity),
subjectivity = mean(subjectivity),
.groups = "drop")
Upvotes: 0
Views: 194
Reputation: 12420
As pointed out in @Bas comment, some word forms are missing from the dictionary. You can solve this by getting a better dictionary, stemming or lemmatization.
Ideally, you would use a lemmatizer, which is superior to stemming. However, I think in the example you've given a stemmer is working fine. So you can use this to construct the dictionary:
library(tidyverse)
library(xml2)
library(tidytext)
library(textstem)
sentiment_nl <- read_xml(
"https://raw.githubusercontent.com/clips/pattern/master/pattern/text/nl/nl-sentiment.xml"
) %>%
as_list() %>%
.[[1]] %>%
map_df(function(x) {
tibble::enframe(attributes(x))
}) %>%
mutate(id = cumsum(str_detect("form", name))) %>%
unnest(value) %>%
pivot_wider(id_cols = id) %>%
mutate(form = tolower(form),
stem = textstem::stem_words(form), # this is the new line
polarity = as.numeric(polarity),
subjectivity = as.numeric(subjectivity),
intensity = as.numeric(intensity),
confidence = as.numeric(confidence))
And then also stem the words in the text before matching on the stems:
df %>%
unnest_tokens(output = word, input = text, drop = FALSE) %>%
mutate(stem = textstem::stem_words(word)) %>%
inner_join(sentiment_nl, by = "stem") %>%
group_by(identifier) %>%
summarise(text = head(text, 1),
polarity = mean(polarity),
subjectivity = mean(subjectivity),
.groups = "drop")
#> # A tibble: 6 x 4
#> identifier text polarity subjectivity
#> <chr> <chr> <dbl> <dbl>
#> 1 1 Het was oké. De bediening kon wat beter, maa… 0.6 0.98
#> 2 3 Slechte bediening, van begin tot eind -0.7 0.9
#> 3 4 Het eten was heerlijk en de bediening was fa… 0.56 0.72
#> 4 5 Ondanks dat het druk was toch op tijd ons et… -0.233 0.767
#> 5 6 Geweldige service en beleefde bediening 0.7 0.95
#> 6 7 Verschrikkelijk. Ik had een vlieg in mijn so… -0.3 0.733
Upvotes: 1