ALW94
ALW94

Reputation: 23

How to match tokens in document term matrix to a separate data frame (of POS codes)

Basically I have my bag of words:

source <- VectorSource(text)
corpus <- Corpus(source)
corpus <- tm_map(corpus, content_transformer(tolower))
dtm <- DocumentTermMatrix(cleanset)

etc etc.

And I have a data frame consisting or just two columns which I called up from a SQLite DB. Column 1 is a list of hundreds of words, and Column 2 is each word's corresponding Part of Speech code.

I am trying to match every token in my dtm to the identical term in column 1 of the dataframe, so that each token then can be matched its corresponding POS code. Essentially, the dataframe is like a dictionary, and I want to match each token in my dtm to its definition.

I tried a bunch of GREP functions to do this, but to no avail. Anyone have thoughts on the best way to approach this?

Thanks!

Upvotes: 1

Views: 932

Answers (1)

andrea
andrea

Reputation: 117

Try the lookup function in the qdap package.

library(qdap)

#create lookup table
words <- c("dog","cat","a", "the","run")
pos <- c("noun","noun","article","article","verb")
random <- c(3,1,2,5,4,1)
df <- data.frame(words, random, pos)

#create doc-term matrix
terms<- c("human", "help","dog","cat","frog", "hello","a","party","run","cheers")
freq <- c(1,2,0,2,3,0,1,4,1,0)
dtm <- data.frame(terms, freq)

#append matches
lookup(dtm$terms, data.frame(df$words,df$pos), missing=NA)

Upvotes: 3

Related Questions