Mark
Mark

Reputation: 1769

Count positive and negative words in a string considering negatives and negation

The following code matches positive and negative words in a text and counts them. Let us consider e.g.

sentences<-c("You are not perfect!", 
            "However, let us not forget what happened across the Atlantic.", 
            "And I can't support you.",
            "No abnormal energy readings",
            "So with gratitude, the universe is abundant forever.")

We first import positive and negative words

pos = readLines("positive-words.txt")
neg = readLines("negative-words.txt")

from txt files. In these files we found:

abundant
gratitude
perfect
support

for positive-words.txt and

abnormal

for negative-words.txt. The following commands:

sentence = gsub("[[:punct:]]", "", sentence)
sentence = gsub("[[:cntrl:]]", "", sentence)
sentence = gsub('\\d+', '', sentence)

remove digits, control characters and punctuations. Then we split sentence into words with str_split (stringr package)

word.list = str_split(sentence, "\\s+")
words = unlist(word.list)

and compare words to the dictionaries of positive & negative terms

pos.matches = match(words, pos)
neg.matches = match(words, neg)
pos.matches = !is.na(pos.matches)
neg.matches = !is.na(neg.matches)

The variable sentence can be sentences[1], sentences[2], sentences[3], sentences[4] or sentences[5]. E.g. if sentence=sentences[5], this code correctly returns two positive words; in fact the result is:

> pos.matches
[1] FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE

The same happens for all the other sentences. E.g. if sentence=sentences[4]:

> neg.matches
[1] FALSE  TRUE FALSE FALSE

Anyway, I would like to modify this code in order to address the situations enclosed in sentences[1], sentences[3] and sentences[4]. In fact: perfect in sentences[1] is a positive word but it is preceded by not and then I would like to consider these two words as one (negative) term; support in sentences[3] is a positive word but it is preceded by cant and then I would like to consider these two words as a negative term; abnormal in sentences[4] is a negative word but it is preceded by no and then I would like to consider these two words as one positive term. E.g. the desired result for sentence=sentences[4] is:

> pos.matches
[1] TRUE FALSE FALSE

Instead, with this code I obtain:

> pos.matches
[1] FALSE FALSE FALSE FALSE

I thought then define a variable with negatives and negations:

NegativesNegations <- paste("\\b(", paste(c("no","not","couldnt","cant"), collapse = "|"), ")\\b")

But I don't know how to move forward with this.

Upvotes: 1

Views: 377

Answers (1)

MarkusN
MarkusN

Reputation: 3223

You can accomplish this task with plain regex. First you transform your positive and negative lists into regex strings as you did with the list of negative negations:

pos_rgx = paste0("\\b(", paste(pos, collapse="|"), ")\\b")
neg_rgx = paste0("\\b(", paste(neg, collapse="|"), ")\\b")

you can now check for every sentence if a positive or negative word exists:

grepl(pos_rgx, sentences, ignore.case=TRUE)
grepl(neg_rgx, sentences, ignore.case=TRUE)

for adding the negations you can proceed accordingly:

pos_neg_rgx = paste0('\\b(no|not|couldn\'t|can\'t)\\s', pos_rgx)
grepl(pos_neg_rgx, sentences)

note: '\\s' means that a single whitespace is between the negation and the positive word

note(2): if you quote your string with single-quotes, then you have to escape the the quote in "can't" (like in the example). Otherwise you can quote the string using double-quotes: "\b(no|not|couldn't|can't)\s"

If you like to dig deepet into text-mining, have a look at package tidytext

Upvotes: 1

Related Questions