Reputation: 6874
I am trying to extract (and eventually categorise) sentences from medical reports that contain negatives. An example is something like:
samples<-c('There is no evidence of a lump','Neither a contusion nor a scar was seen','No inflammation was evident','We found generalised badness here')
I am trying to use the sentimentr
package as it seems it is able to detect negators. Is there a way of just using the detection of negators so that negative sentences are extracted out (preferably into a new dataframe for further work)?
Using polarity
from qdap
just gives a summary statistic and is based on including amplifiers and deamplifiers which I dont want to include eg.
polarity(samples,negators = qdapDictionaries::negation.words)
all total.sentences total.words ave.polarity sd.polarity stan.mean.polarity
1 all 4 24 0.213 0.254 0.842
I tried the sentimentr package as follows:
extract_sentiment_terms(MyColonData$Endo_ResultText,polarity_dt = lexicon::hash_sentiment_jockers, hyphen = "")
and this gives me neutral, negative and positive words:
element_id sentence_id negative positive
1: 1 1
2: 2 1 scar
3: 3 1 inflammation evident
4: 4 1 badness found
but I am really looking for sentences that contain negators only without interpretation of the sentiment so that the output is:
element_id sentence_id negative positive
1: 1 1 There is no evidence of a lump
2: 2 1 Neither a contusion nor a scar was seen
3: 3 1 No inflammation was evident
4: 4 1 We found generalised badness here
Upvotes: 2
Views: 362
Reputation: 6325
I think you want to classify the text positive and negative only based on the presence of negator
hence extracting negator from lexicon
should help.
samples<-c('There is no evidence of a lump','Neither a contusion nor a scar was seen','No inflammation was evident','We found generalised badness here')
polarity <- data.frame(text = samples, pol = NA)
polarity$pol <- ifelse(grepl(paste(lexicon::hash_valence_shifters[y==1]$x,collapse = '|'), tolower(samples)),'Negative','Positive')
polarity
text pol
1 There is no evidence of a lump Negative
2 Neither a contusion nor a scar was seen Negative
3 No inflammation was evident Negative
4 We found generalised badness here Positive
Formatted OP:
reshape2::dcast(polarity,text~pol)
text Negative Positive
1 Neither a contusion nor a scar was seen Negative <NA>
2 No inflammation was evident Negative <NA>
3 There is no evidence of a lump Negative <NA>
4 We found generalised badness here <NA> Positive
Upvotes: 3
Reputation: 2206
If I understand you correctly, you want to extract whole sentences if one of their words matches either a positive or negative annotation in the lexicon::hash_sentiment_jockers
. For this case you can use below code (might be tuned up by using data.table
in the interim steps if needed). I hope this is what you are looking for.
library(lexicon)
library(data.table)
library(stringi)
#check the content of the lexicon
lex <- copy(lexicon::hash_sentiment_jockers)
# x y
# 1: abandon -0.75
# 2: abandoned -0.50
# 3: abandoner -0.25
# 4: abandonment -0.25
# 5: abandons -1.00
# ---
# 10735: zealous 0.40
# 10736: zenith 0.40
# 10737: zest 0.50
# 10738: zombie -0.25
# 10739: zombies -0.25
#only consider binary positive or negative
pos <- lex[y > 0]
neg <- lex[y < 0]
samples <-c('There is no evidence of a lump'
,'Neither a contusion nor a scar was seen'
,'No inflammation was evident'
,'We found generalised badness here')
#get ids of the samples that inlcude positve/negative terms
samples_pos <- which(stri_detect_regex(samples, paste(pos[,x], collapse = "|")))
samples_neg <- which(stri_detect_regex(samples, paste(neg[,x], collapse = "|")))
#set up data.frames with all positive/negative samples and their ids
df_pos <- data.frame(sentence_id = samples_pos, positive = samples[samples_pos])
df_neg <- data.frame(sentence_id = samples_neg, negative = samples[samples_neg])
#combine the sets
rbindlist(list(df_pos, df_neg), use.names = TRUE, fill = T)
# sentence_id positive negative
# 1: 3 No inflammation was evident NA
# 2: 4 We found generalised badness here NA
# 3: 2 NA Neither a contusion nor a scar was seen
# 4: 3 NA No inflammation was evident
# 5: 4 NA We found generalised badness here
#the first sentence is missing, since none of its words is inlcuded in
#the lexcicon, you might use stemming, etc. to increase coverage
any(grepl("evidence", lexicon::hash_sentiment_jockers[,x]))
#[1] FALSE
Upvotes: 2