Reputation: 1326
Apologies if this turns out to be a very specific problem, which may not generalise to that of others'.
Background
I hope to do some sentiment analysis, starting from the basic binary matching of words from a lexicon, and then moving towards some more complex form of Sentiment Analysis, making use of grammatical rules, etc.
Problem
To do some binary matching - which will form the first phase of Sentiment Analysis - I am provided with two tables, one containing words, and the other containing Parts-Of-Speech for these words.
V1 V2 V3 V4 V5
1 R is fantastic language <NA>
2 Java is far from good
3 Data mining is fascinating <NA>
V1 V2 V3 V4 V5
1 NN VBZ JJ NN <NA>
2 NNP VBZ RB IN JJ
3 NNP NN VBZ JJ <NA>
I would like to carry out some basic Sentiment Analysis as follows: I want to apply a function that takes two arguments, a word (from the 1st data frame) and its corresponding POS tag (from the second) to determine which list words to use in determining positive/negative orientation of a word. For example, the word fantastic would be extracted along with the POS tag 'JJ', and so the list of adjectives alone would be inspected for presence/absence of this word.
Eventually, I would like to end up with a data frame that shows the result of matching:
V1 V2 V3 V4 V5
1 0 0 1 0 <NA>
2 0 0 -1 0 1
3 0 0 0 1 <NA>
I tried formulating my own code, but kept getting an error, after which I felt this was not going to work.
#test sentences
sentences<- as.list(c("R is fantastic language", "Java is far from good", "Data mining is fascinating"))
#using the OpenNLP package
require(openNLP)
#perform tagging
taggedSentences<- tagPOS(sentences)
#split to words
individualWords<- unname(sapply(taggedSentences, function(x){strsplit(x,split=" ")}))
#Strip Tags
individualWordsClean<- unname(sapply(individualWords, function(x){gsub("/.+","",x)}))
#Strip words
individualTags<- unname(sapply(individualWords, function(x){gsub(".+/","",x)}))
#create a dataframe for words; courtesy @trinker
numberRow<- length(individualWords)
numberCol<- unname(sapply(individualWords, length))
df1<- as.data.frame(matrix(nrow=numberRow, ncol=max(numberCol)))
for (i in 1:numberRow){
df1[i,1:numberCol[i]]<- individualWordsClean [[i]]
}
#create a dataframe for tags; courtesy @trinker
numberRow<- length(individualWords)
numberCol<- unname(sapply(individualTags, length))
df2<- as.data.frame(matrix(nrow=numberRow, ncol=max(numberCol)))
for (i in 1:numberRow){
df2[i,1:numberCol[i]]<- individualTags [[i]]
}
#Create negative/positive words' lists
posAdj<- c("fantastic","fascinating","good")
negAdj<- c("bad","poor")
posNoun<- "R"
negNoun<- "Java"
#Function to match words and assign sentiment score
checkLexicon<- function(word,tag){
if (grep("JJ|JJR|JJS",tag)){
ifelse(word %in% posAdj, +1, ifelse(word %in% negAdj, -1, 0))
}
else if(grep("NN|NNP|NNPS|NNS",tag)){
ifelse(word %in% posNoun, +1, ifelse(word %in% negNoun, -1, 0))
}
else if(grep("VBZ",tag)){
ifelse(word %in% "is","ok","none")
}
else if(grep("RB",tag)){
ifelse(word %in% "not",-1,0)
}
else if(grep("IN",tag)){
ifelse(word %in% "far",-1,0)
}
}
#Method to output a single value when used in conjuction with apply
justShow<- function(x){
x
}
#Main method that intends to extract word/POS tag pair, and determine sentiment score
mapply(FUN=checkLexicon, word=apply(df1,2,justShow),tag=apply(df2,2,justShow))
Unfortunately, I have had no success with this method, and the error received is
Error in if (grep("JJ|JJR|JJS", tag)) { : argument is of length zero
I am relatively new to R, but it seems that I am unable to use the apply
function here, as it returns no argument to the mapply
function. Also, I am not sure if mapply will actually produce another data frame.
Please do criticise/advise. Thanks
PS. Link to TRinker's notes on R for those interested.
Upvotes: 1
Views: 443
Reputation: 1326
The mistake was attempting to use grep
as grepl
. This was corrected after Joran pointed it out.
The working function is as follows.
>df1
V1 V2 V3 V4 V5
1 R is fantastic language <NA>
2 Java is far from good
3 Data mining is fascinating <NA>
>df2
V1 V2 V3 V4 V5
1 NN VBZ JJ NN <NA>
2 NNP VBZ RB IN JJ
3 NNP NN VBZ JJ <NA>
#Function to match words and assign sentiment score
checkLexicon<- function(word,tag){
if (grepl("JJ|JJR|JJS",tag)){
ifelse(word %in% posAdj, +1, ifelse(word %in% negAdj, -1, 0))
}
else if(grepl("NN|NNP|NNPS|NNS",tag)){
ifelse(word %in% posNoun, +1, ifelse(word %in% negNoun, -1, 0))
}
else if(grepl("VBZ",tag)){
ifelse(word %in% "is","ok","none")
}
else if(grepl("RB",tag)){
ifelse(word %in% "not",-1,0)
}
else if(grepl("IN",tag)){
ifelse(word %in% "far",-1,0)
}
}
#Method to output a single value when used in conjuction with apply
justShow<- function(x){
x
}
#Main method that intends to extract word/POS tag pair, and determine sentiment score
myObject<- mapply(FUN=checkLexicon, word=apply(df1,2,justShow),tag=apply(df2,2,justShow))
#Shaping the final dataframe
scoredDF<- as.data.frame(matrix(myObject,nrow=3))
V1 V2 V3 V4 V5
1 1 ok 1 0 NULL
2 -1 ok 0 0 1
3 0 0 ok 1 NULL
Upvotes: 1