Reputation: 155
I am trying to create Term-Document matrix using R from a corpus of file. But on running the code I am getting this error followed by 2 warnings:
Error in simple_triplet_matrix(i = i, j = j, v = as.numeric(v), nrow = length(allTerms), :
'i, j' invalid
Calls: DocumentTermMatrix ... TermDocumentMatrix.VCorpus -> simple_triplet_matrix -> .Call
In addition: Warning messages:
1: In mclapply(unname(content(x)), termFreq, control) :
scheduled core 1 encountered error in user code, all values of the job will be affected
2: In simple_triplet_matrix(i = i, j = j, v = as.numeric(v), nrow = length(allTerms), :
NAs introduced by coercion
My code is given below:
library(tm)
library(RWeka)
library(tmcn.word2vec)
#Reading data
data <- read.csv("Train.csv", header=T)
#text <- data$EventDescription
#Pre-processing
corpus <- Corpus(VectorSource(data$EventDescription))
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, PlainTextDocument)
#dataframe <- data.frame(text=unlist(sapply(corpus,'[',"content")))
#Reading dictionary file
dict <- scan("dictionary.txt", what='character',sep='\n')
#Bigram Tokenization
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 4))
tdm_doc <- DocumentTermMatrix(corpus,control=list(stopwords = dict, tokenize=BigramTokenizer))
tdm_dic <- DocumentTermMatrix(corpus,control=list(tokenize=BigramTokenizer, dictionary=dict))
As given in other answers in SO, I have tried installing SnowballC package and other listed ideas. Still I am getting the same error. Can anyone help me in this regard? Thanks in advance.
Upvotes: 10
Views: 14108
Reputation: 428
I had the same problem for getting my DocumnetTermMatrix and I solved it by removing the following command:
corpus <- tm_map(corpus, PlainTextDocument)
Upvotes: 19
Reputation: 5966
I had a similar error when cleaning a corpus. To fix the problem I added the following after the offending line of code and it fixed it. Some of the tm_map functions do not return a corpus...
corpus <- Corpus(VectorSource(corpus))
For me the problem arose after stem completion. I would suggest trying to make a tdm after each tm_map call. That will tell you which cleaning step is causing the problem.
Best of luck!
Upvotes: 13