bgreen
bgreen

Reputation: 87

How to obtain and save trigrams from text mining program TM - in text or csv format

I'm hoping to identify tigrams and phrases in a corpus using TM and save the output as a text or csv file. I haven't found a way to do this in Quanteda: How to save n-gram output

This reproducible code works: ///

library(tm)
data("crude")
crude <- as.VCorpus(crude)
crude <- tm_map(crude, content_transformer(tolower))
summary(crude)
# This tokenizer is built on NLP and creates bigrams. 
# If you want multi-grams specify 1:2 for uni- and bi-gram, 
# 2:3 for bi- and trigram, 1:3 for uni-, bi- and tri-grams.
# etc. etc. ...(ngrams(words(x), 1:3)...
bigram_tokenizer <- function(x) {
  unlist(lapply(ngrams(words(x), 2), paste, collapse = " "), use.names = FALSE)
}
dtm <- DocumentTermMatrix(crude, control=list(tokenizer = bigram_tokenizer))
inspect(dtm)

///

However, when I try it on my own data I get this error.

 1: In TermDocumentMatrix.SimpleCorpus(x, control) :
    custom functions are ignored
   2: In TermDocumentMatrix.SimpleCorpus(x, control) :
  custom tokenizer is ignored

My code and output follows:

///

source <- DirSource("D:/Hanson/Newdata/Hanson2/") #input path for documents
TMCorpus <- Corpus(source, readerControl=list(reader=readPlain)) #load in documents
TMCorpus<- tm_map(TMCorpus, removePunctuation)
TMCorpus <- tm_map(TMCorpus , stripWhitespace)
TMCorpus <- tm_map(TMCorpus, tolower)

bigram_tokenizer <- function(x) {
  unlist(lapply(ngrams(words(x), 3), paste, collapse = " "), use.names = FALSE)
}

adtm<-DocumentTermMatrix(TMCorpus, control=list(tokenizer = bigram_tokenizer))
inspect(adtm)

///

summary(TMCorpus) Length Class Mode

1_15-6-23.txt 2 PlainTextDocument list

10_14-6-23.txt 2 PlainTextDocument list

100_24-11-22.txt 2 PlainTextDocument list

1000_1-12-16.txt 2 PlainTextDocument list .....

512_14-11-19.txt 2 PlainTextDocument list

513_13-11-19 (2).txt 2 PlainTextDocument list

[ reached getOption("max.print") -- omitted 369 rows ]

Upvotes: 0

Views: 22

Answers (0)

Related Questions