Reputation: 87
I'm hoping to identify tigrams and phrases in a corpus using TM and save the output as a text or csv file. I haven't found a way to do this in Quanteda: How to save n-gram output
This reproducible code works: ///
library(tm)
data("crude")
crude <- as.VCorpus(crude)
crude <- tm_map(crude, content_transformer(tolower))
summary(crude)
# This tokenizer is built on NLP and creates bigrams.
# If you want multi-grams specify 1:2 for uni- and bi-gram,
# 2:3 for bi- and trigram, 1:3 for uni-, bi- and tri-grams.
# etc. etc. ...(ngrams(words(x), 1:3)...
bigram_tokenizer <- function(x) {
unlist(lapply(ngrams(words(x), 2), paste, collapse = " "), use.names = FALSE)
}
dtm <- DocumentTermMatrix(crude, control=list(tokenizer = bigram_tokenizer))
inspect(dtm)
///
However, when I try it on my own data I get this error.
1: In TermDocumentMatrix.SimpleCorpus(x, control) :
custom functions are ignored
2: In TermDocumentMatrix.SimpleCorpus(x, control) :
custom tokenizer is ignored
My code and output follows:
///
source <- DirSource("D:/Hanson/Newdata/Hanson2/") #input path for documents
TMCorpus <- Corpus(source, readerControl=list(reader=readPlain)) #load in documents
TMCorpus<- tm_map(TMCorpus, removePunctuation)
TMCorpus <- tm_map(TMCorpus , stripWhitespace)
TMCorpus <- tm_map(TMCorpus, tolower)
bigram_tokenizer <- function(x) {
unlist(lapply(ngrams(words(x), 3), paste, collapse = " "), use.names = FALSE)
}
adtm<-DocumentTermMatrix(TMCorpus, control=list(tokenizer = bigram_tokenizer))
inspect(adtm)
///
summary(TMCorpus) Length Class Mode
1_15-6-23.txt 2 PlainTextDocument list
10_14-6-23.txt 2 PlainTextDocument list
100_24-11-22.txt 2 PlainTextDocument list
1000_1-12-16.txt 2 PlainTextDocument list .....
512_14-11-19.txt 2 PlainTextDocument list
513_13-11-19 (2).txt 2 PlainTextDocument list
[ reached getOption("max.print") -- omitted 369 rows ]
Upvotes: 0
Views: 22