Reputation: 47
I'm looking for specific n-grams in a corpus. Let's say I want to find 'asset management' and 'historical yield' in a collection of documents.
This is how I loaded the corpus
my_corpus <- VCorpus(DirSource(directory, pattern = ".pdf"),
readerControl = list(reader = readPDF)
I cleaned the corpus and did some basic calculations using document term matrices. Now I want to look for particular expressions and put them in a dataframe. This is what I use (thanks to phiver):
ngrams <- c('asset management', 'historical yield')
dtm_ngrams <- DocumentTermMatrix(my_corpus, control = list(dictionary = ngrams))
df_ngrams <- data.frame(Docs = dtm$dimnames$Docs, as.matrix(dtm_ngrams), row.names = NULL )
This code runs, but the result is 0 for both n-grams. So, I'm guessing the problem is that the library is not defined correctly because R doesn't pick up the space between the words. So far, I tried to put '' between the words, or [:space:] and some other solutions.
Upvotes: 0
Views: 1532
Reputation: 23598
A document term matrix without any further manipulation contains only single words (and words of nchar 3 or more). If you want to have bigrams, you need to create a term matrix of bigrams (or uni and bigrams).
Based on your example and using just tm and NLP which is loaded as soon as you call tm, we can make a bigram tokenizer. Or multi-gram, see comment in code.
Using the built in crude data set again.
library(tm)
data("crude")
crude <- as.VCorpus(crude)
crude <- tm_map(crude, content_transformer(tolower))
# This tokenizer is built on NLP and creates bigrams.
# If you want multi-grams specify 1:2 for uni- and bi-gram,
# 2:3 for bi- and trigram, 1:3 for uni-, bi- and tri-grams.
# etc. etc. ...(ngrams(words(x), 1:3)...
bigram_tokenizer <- function(x) {
unlist(lapply(ngrams(words(x), 2), paste, collapse = " "), use.names = FALSE)
}
my_words <- c("crude oil", "west texas")
dtm <- DocumentTermMatrix(crude, control=list(tokenizer = bigram_tokenizer, dictionary = my_words))
inspect(dtm)
<<DocumentTermMatrix (documents: 20, terms: 2)>>
Non-/sparse entries: 11/29
Sparsity : 72%
Maximal term length: 10
Weighting : term frequency (tf)
Sample :
Terms
Docs crude oil west texas
127 2 1
144 0 0
191 2 0
194 1 2
211 0 0
273 2 0
349 1 0
353 1 0
543 1 1
708 1 0
After this you can put the dtm into a data.frame again as mentioned in your previous question
Upvotes: 1