Reputation: 1769
The following code matches positive and negative words in a text and counts them. Let us consider e.g.
sentences<-c("You are not perfect!",
"However, let us not forget what happened across the Atlantic.",
"And I can't support you.",
"No abnormal energy readings",
"So with gratitude, the universe is abundant forever.")
We first import positive and negative words
pos = readLines("positive-words.txt")
neg = readLines("negative-words.txt")
from txt files. In these files we found:
abundant
gratitude
perfect
support
for positive-words.txt
and
abnormal
for negative-words.txt
. The following commands:
sentence = gsub("[[:punct:]]", "", sentence)
sentence = gsub("[[:cntrl:]]", "", sentence)
sentence = gsub('\\d+', '', sentence)
remove digits, control characters and punctuations. Then we split sentence into words with str_split (stringr package)
word.list = str_split(sentence, "\\s+")
words = unlist(word.list)
and compare words to the dictionaries of positive & negative terms
pos.matches = match(words, pos)
neg.matches = match(words, neg)
pos.matches = !is.na(pos.matches)
neg.matches = !is.na(neg.matches)
The variable sentence
can be sentences[1]
, sentences[2]
, sentences[3]
, sentences[4]
or sentences[5]
. E.g. if sentence=sentences[5]
, this code correctly returns two positive words; in fact the result is:
> pos.matches
[1] FALSE FALSE TRUE FALSE FALSE FALSE TRUE FALSE
The same happens for all the other sentences. E.g. if sentence=sentences[4]
:
> neg.matches
[1] FALSE TRUE FALSE FALSE
Anyway, I would like to modify this code in order to address the situations enclosed in sentences[1]
, sentences[3]
and sentences[4]
. In fact: perfect
in sentences[1]
is a positive word but it is preceded by not
and then I would like to consider these two words as one (negative) term; support
in sentences[3]
is a positive word but it is preceded by cant
and then I would like to consider these two words as a negative term; abnormal
in sentences[4]
is a negative word but it is preceded by no
and then I would like to consider these two words as one positive term. E.g. the desired result for sentence=sentences[4]
is:
> pos.matches
[1] TRUE FALSE FALSE
Instead, with this code I obtain:
> pos.matches
[1] FALSE FALSE FALSE FALSE
I thought then define a variable with negatives and negations:
NegativesNegations <- paste("\\b(", paste(c("no","not","couldnt","cant"), collapse = "|"), ")\\b")
But I don't know how to move forward with this.
Upvotes: 1
Views: 377
Reputation: 3223
You can accomplish this task with plain regex. First you transform your positive and negative lists into regex strings as you did with the list of negative negations:
pos_rgx = paste0("\\b(", paste(pos, collapse="|"), ")\\b")
neg_rgx = paste0("\\b(", paste(neg, collapse="|"), ")\\b")
you can now check for every sentence if a positive or negative word exists:
grepl(pos_rgx, sentences, ignore.case=TRUE)
grepl(neg_rgx, sentences, ignore.case=TRUE)
for adding the negations you can proceed accordingly:
pos_neg_rgx = paste0('\\b(no|not|couldn\'t|can\'t)\\s', pos_rgx)
grepl(pos_neg_rgx, sentences)
note: '\\s' means that a single whitespace is between the negation and the positive word
note(2): if you quote your string with single-quotes, then you have to escape the the quote in "can't" (like in the example). Otherwise you can quote the string using double-quotes: "\b(no|not|couldn't|can't)\s"
If you like to dig deepet into text-mining, have a look at package tidytext
Upvotes: 1