Reputation: 65
I can't comment on this page where i found a function Sentiment Analysis Text Analytics in Russian / Cyrillic languages
get_sentiment_rus <- function(char_v, method="custom", lexicon=NULL, path_to_tagger = NULL, cl = NULL, language = "english") {
language <- tolower(language)
russ.char.yes <- "[\u0401\u0410-\u044F\u0451]"
russ.char.no <- "[^\u0401\u0410-\u044F\u0451]"
if (is.na(pmatch(method, c("syuzhet", "afinn", "bing", "nrc",
"stanford", "custom"))))
stop("Invalid Method")
if (!is.character(char_v))
stop("Data must be a character vector.")
if (!is.null(cl) && !inherits(cl, "cluster"))
stop("Invalid Cluster")
if (method == "syuzhet") {
char_v <- gsub("-", "", char_v)
}
if (method == "afinn" || method == "bing" || method == "syuzhet") {
word_l <- strsplit(tolower(char_v), "[^A-Za-z']+")
if (is.null(cl)) {
result <- unlist(lapply(word_l, get_sent_values,
method))
}
else {
result <- unlist(parallel::parLapply(cl = cl, word_l,
get_sent_values, method))
}
}
else if (method == "nrc") {
# word_l <- strsplit(tolower(char_v), "[^A-Za-z']+")
word_l <- strsplit(tolower(char_v), paste0(russ.char.no, "+"), perl=T)
lexicon <- dplyr::filter_(syuzhet:::nrc, ~lang == tolower(language),
~sentiment %in% c("positive", "negative"))
lexicon[which(lexicon$sentiment == "negative"), "value"] <- -1
result <- unlist(lapply(word_l, get_sent_values, method,
lexicon))
}
else if (method == "custom") {
# word_l <- strsplit(tolower(char_v), "[^A-Za-z']+")
word_l <- strsplit(tolower(char_v), paste0(russ.char.no, "+"), perl=T)
result <- unlist(lapply(word_l, get_sent_values, method,
lexicon))
}
else if (method == "stanford") {
if (is.null(path_to_tagger))
stop("You must include a path to your installation of the coreNLP package. See http://nlp.stanford.edu/software/corenlp.shtml")
result <- get_stanford_sentiment(char_v, path_to_tagger)
}
return(result)
}
It gives an error
> mysentiment <- get_sentiment_rus(as.character(corpus))
Show Traceback
Rerun with Debug
Error in UseMethod("filter_") :
no applicable method for 'filter_' applied to an object of class "NULL"
And the sentiment scores are equal to 0
> SentimentScores <- data.frame(colSums(mysentiment[,]))
> SentimentScores
colSums.mysentiment.....
anger 0
anticipation 0
disgust 0
fear 0
joy 0
sadness 0
surprise 0
trust 0
negative 0
positive 0
Could you please point out where a problem might be? Or suggest any other working method for sentiment analysis в R? Just wonder what package supports russian language.
I am looking for any working method for sentiment analysis of a text in russian.
Upvotes: 2
Views: 1226
Reputation: 12420
It looks to me like your function did not really find any sentiment words in your text. This might have to do with the sentiment dictionary you are using. Instead of trying to repair this function, you might want to consider a tidy approach instead, which is outlined in the book "Text Mining with R. A Tidy Approach". The advantage is that it does not mind the cyrillic letters and that it is really easy to understand and tweak.
First, we need a dictionary with sentiment values. I found one on GitHub, which we can directly read into R:
library(rvest)
library(stringr)
library(tidytext)
library(dplyr)
dict <- readr::read_csv("https://raw.githubusercontent.com/text-machine-lab/sentimental/master/sentimental/word_list/russian.csv")
Next, let's get some test data to work with. For no particular reason, I use the Russian Wikipedia entry for Brexit and scrape the text:
brexit <- "https://ru.wikipedia.org/wiki/%D0%92%D1%8B%D1%85%D0%BE%D0%B4_%D0%92%D0%B5%D0%BB%D0%B8%D0%BA%D0%BE%D0%B1%D1%80%D0%B8%D1%82%D0%B0%D0%BD%D0%B8%D0%B8_%D0%B8%D0%B7_%D0%95%D0%B2%D1%80%D0%BE%D0%BF%D0%B5%D0%B9%D1%81%D0%BA%D0%BE%D0%B3%D0%BE_%D1%81%D0%BE%D1%8E%D0%B7%D0%B0" %>%
read_html() %>%
html_nodes("body") %>%
html_text() %>%
tibble(text = .)
Now this data can be turned into a tidy format. I split the text into paragraphs first, so we can check sentiment scores for paragraphs individually.
brexit_tidy <- brexit %>%
unnest_tokens(output = "paragraph", input = "text", token = "paragraphs") %>%
mutate(id = seq_along(paragraph)) %>%
unnest_tokens(output = "word", input = "paragraph", token = "words")
The way a dictionary is used with tidy data is incredibly straightwoard from this point. You just combine the data frame with sentiment values (i.e., the dictionary) and the data frame with the words in your text. Where text and dictionary match, the sentiment value is added. All other values are dropped.
# apply dictionary
brexit_sentiment <- brexit_tidy %>%
inner_join(dict, by = "word")
head(brexit_sentiment)
#> # A tibble: 6 x 3
#> id word score
#> <int> <chr> <dbl>
#> 1 7 затяжной -1.7
#> 2 13 против -5
#> 3 22 популярность 5
#> 4 22 против -5
#> 5 23 нужно 1.7
#> 6 39 против -5
Instead of the value for each word, you probably prefer the values per paragraphs. This can easily be done by getting the mean for each paragraph:
# group sentiment by paragraph
brexit_sentiment %>%
group_by(id) %>%
summarise(sentiment = mean(score))
#> # A tibble: 25 x 2
#> id sentiment
#> <int> <dbl>
#> 1 7 -1.7
#> 2 13 -5
#> 3 22 0
#> 4 23 1.7
#> 5 39 -5
#> 6 42 5
#> 7 43 -1.88
#> 8 44 -3.32
#> 9 45 -3.35
#> 10 47 1.7
#> # … with 15 more rows
There are a couple of ways this approach could be improved if necessary:
Upvotes: 7