ldlpdx
ldlpdx

Reputation: 61

R function for pattern matching

I am doing a text mining project that will analyze some speeches from the three remaining presidential candidates. I have completed POS tagging with OpenNLP and created a two column data frame with the results. I have added a variable, called pair. Here is a sample from the Clinton data frame:

           V1   V2  pair
1          c(  NN  FALSE
2      "thank VBP  FALSE
3         you PRP  FALSE
4          so  RB  FALSE
5        much  RB  FALSE
6           .   .  FALSE
7          it PRP  FALSE
8          is VBZ  FALSE
9   wonderful  JJ  FALSE
10         to  TO  FALSE
11         be  VB  FALSE
12       here  RB  FALSE
13        and  CC  FALSE
14        see  VB  FALSE
15         so  RB  FALSE
16       many  JJ  FALSE
17    friends NNS  FALSE
18          .   .  FALSE
19        ive  JJ  FALSE
20     spoken VBN  FALSE 

What I'm now trying to do is write a function that will iterate through the V2 POS column and evaluate it for specific pattern pairs. (These come from Turney's PMI article.) I'm not yet very knowledgeable when it comes to writing functions, so I'm certain I've done it wrong, but here is what I've got so far.

pairs <- function(x){

  JJ <- "JJ"      #adjectives
  N <- "N[A-Z]"   #any noun form
  R <- "R[A-Z]"   #any adverb form
  V <- "V[A-Z]"   #any verb form

  for(i in 1:(length)(x) {
      if(x == J && x+1 == N) {    #i.e., if the first word = J and the next = N
        pair[i] <- "JJ|NN"     #insert this into the 'pair' variable
      } else if (x == R && x+1 == J && x+2 != N) {
        pair[i] <- "RB|JJ"
      } else if  (x == J && x+1 == J && x+2 != N) {
        pair[i] <- "JJ|JJ"
      } else if (x == N && x+1 == J && x+2 != N) {
        pair[i] <- "NN|JJ"
      } else if (x == R && x+1 == V) {
        pair[i] <- "RB|VB"
         } else {
         pair[i] <- "FALSE"
         }
  }
}

# Run the function
cl.df.pairs <- pairs(cl.df$V2)

There are a number of (truly embarrassing) issues. First, when I try to run the function code, I get two Error: unexpected '}' in " }" errors at the end. I can't figure out why, because they match opening "{". I'm assuming it's because R is expecting something else to be there.

Also, and more importantly, this function won't exactly get me what I want, which is to extract the word pairs that match a pattern and then the pattern that they match. I honestly have no idea how to do that.

Then I need to figure out how to evaluate the semantic orientation of each word combo by comparing the phrases to the pos/neg lexical data sets that I have, but that's a whole other issue. I have the formula from the article, which I'm hoping will point me in the right direction.

I have looked all over and can't find a comparable function in any of the NLP packages, such as OpenNLP, RTextTools, etc. I HAVE looked at other SO questions/answers, like this one and this one, but they haven't worked for me when I've tried to adapt them. I'm fairly certain I'm missing something obvious here, so would appreciate any advice.

EDIT:

Here is the first 20 lines of the Sanders data frame.

head(sa.POS.df, 20)
           V1   V2
1         the   DT
2    american   JJ
3      people  NNS
4         are  VBP
5    catching  VBG
6          on   RB
7           .    .
8        they  PRP
9  understand  VBP
10       that   IN
11  something   NN
12         is  VBZ
13 profoundly   RB
14      wrong   JJ
15       when  WRB
16          ,    ,
17         in   IN
18        our PRP$
19    country   NN
20      today   NN

And I've written the following function:

pairs <- function(x, y) {
  require(gsubfn)
  J <- "JJ"      #adjectives
  N <- "N[A-Z]"   #any noun form
  R <- "R[A-Z]"   #any adverb form
  V <- "V[A-Z]"   #any verb form

  for(i in 1:(length(x))) {
    ngram <- c(x[[i]], x[[i+1]]) 
# the ngram consists of the word on line `i` and the word below line `i`
  }
  strapply(y[i], "(J)\n(N)", FUN = paste(ngram, sep = " "), simplify = TRUE)

  ngrams.df = data.frame(ngrams=ngram)
  return(ngrams.df)
}

So, what is SUPPOSED to happen is that when strapply matches the pattern (in this case, an adjective followed by a noun, it should paste the ngram. And all of the resulting ngrams should populate the ngrams.df.
So I've entered the following function call and get an error:

> sa.JN <- pairs(x=sa.POS.df$V1, y=sa.POS.df$V2)
Error in x[[i + 1]] : subscript out of bounds  

I'm only just learning the intricacies of regular expressions, so I'm not quite sure how to get my function to pull the actual adjective and noun. Based on the data shown here, it should pull "american" and "people" and paste them into the data frame.

Upvotes: 0

Views: 884

Answers (2)

Gregor Thomas
Gregor Thomas

Reputation: 145765

Okay, here we go. Using this data (shared nicely with dput()):

df = structure(list(V1 = structure(c(15L, 3L, 11L, 4L, 5L, 9L, 2L, 
16L, 18L, 14L, 13L, 8L, 12L, 20L, 19L, 1L, 7L, 10L, 6L, 17L), .Label = c(",", 
".", "american", "are", "catching", "country", "in", "is", "on", 
"our", "people", "profoundly", "something", "that", "the", "they", 
"today", "understand", "when", "wrong"), class = "factor"), V2 = structure(c(3L, 
5L, 7L, 12L, 11L, 10L, 2L, 8L, 12L, 4L, 6L, 13L, 10L, 5L, 14L, 
1L, 4L, 9L, 6L, 6L), .Label = c(",", ".", "DT", "IN", "JJ", "NN", 
"NNS", "PRP", "PRP$", "RB", "VBG", "VBP", "VBZ", "WRB"), class = "factor")), .Names = c("V1", 
"V2"), class = "data.frame", row.names = c("1", "2", "3", "4", 
"5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", 
"16", "17", "18", "19", "20"))

I'll use the stringr package because of its consistent syntax so I don't have to look up the argument order for grep. We'll first detect the adjectives, then the nouns, and figure out where the line up (offsetting by 1). Then paste the words together that correspond to the matches.

library(stringr)
adj = str_detect(df$V2, "JJ")
noun = str_detect(df$V2, "NN")

pairs = which(c(FALSE, adj) & c(noun, FALSE))

ngram = paste(df$V1[pairs - 1], df$V1[pairs])
# [1] "american people"

Now we can put it in a function. I left the patterns as arguments (with adjective, noun as the defaults) for flexibility.

bigram = function(word, type, patt1 = "JJ", patt2 = "N[A-Z]") {
    pairs = which(c(FALSE, str_detect(type, pattern = patt1)) &
                      c(str_detect(type, patt2), FALSE))
    return(paste(word[pairs - 1], word[pairs]))
}

Demonstrating use on the original data

with(df, bigram(word = V1, type = V2))
# [1] "american people"

Let's cook up some data with more than one match to make sure it works:

df2 = data.frame(w = c("american", "people", "hate", "a", "big", "bad",  "bank"),
                 t = c("JJ", "NNS", "VBP", "DT", "JJ", "JJ", "NN"))
df2
#          w   t
# 1 american  JJ
# 2   people NNS
# 3     hate VBP
# 4        a  DT
# 5      big  JJ
# 6      bad  JJ
# 7     bank  NN

with(df2, bigram(word = w, type = t))
# [1] "american people" "bad bank"

And back to the original to test out a different pattern:

with(df, bigram(word = V1, type = V2, patt1 = "N[A-Z]", patt2 = "V[A-Z]"))
# [1] "people are"   "something is"

Upvotes: 1

Erin
Erin

Reputation: 386

I think the following is the code you wrote, but without throwing errors:

pairs <- function(x) {

  J <- "JJ"      #adjectives
  N <- "N[A-Z]"   #any noun form
  R <- "R[A-Z]"   #any adverb form
  V <- "V[A-Z]"   #any verb form

  pair = rep("FALSE", length(x))
  for(i in 1:(nrow(x)-2)) {
    this.pos = x[i,2]
    next.pos = x[i+1,2]
    next.next.pos = x[i+2,2]
    if(this.pos == J && next.pos == N) {    #i.e., if the first word = J and the next = N
      pair[i] <- "JJ|NN"     #insert this into the 'pair' variable
    } else if (this.pos == R && next.pos == J && next.next.pos != N) {
      pair[i] <- "RB|JJ"
    } else if  (this.pos == J && next.pos == J && next.next.pos != N) {
      pair[i] <- "JJ|JJ"
    } else if (this.pos == N && next.pos == J && next.next.pos != N) {
      pair[i] <- "NN|JJ"
    } else if (this.pos == R && next.pos == V) {
      pair[i] <- "RB|VB"
    } else {
      pair[i] <- "FALSE"
    }
  }

  ## then deal with the last two elements, for which you can't check what's up next

  return(pair)
}

not sure what you mean by this, though:

Also, and more importantly, this function won't exactly get me what I want, which is to extract the word pairs that match a pattern and then the pattern that they match. I honestly have no idea how to do that.

Upvotes: 1

Related Questions