Caleb
Caleb

Reputation: 11

textProcessor changes the number of observations of my corpus (using with stm package in R)

I'm working with a dataset that has 439 observations for text analysis in stm. When I use textProcessor, the number of observations changes to 438 for some reason. This creates problems later on: when using the findThoughts() function, for example.

##############################################
#PREPROCESSING
##############################################

#Process the data for analysis.
temp<-textProcessor(sovereigncredit$Content,sovereigncredit, customstopwords = customstop, stem=FALSE)
meta<-temp$meta
vocab<-temp$vocab
docs<-temp$documents
length(docs) # QUESTION: WHY IS THIS 438 instead of 439, like the original dataset?
length(sovereigncredit$Content) # See, this original one is 439.
out <- prepDocuments(docs, vocab, meta)
docs<-out$documents
vocab<-out$vocab
meta <-out$meta

An example of this becoming a problem down the line is:

thoughts1<-findThoughts(sovereigncredit1, texts=sovereigncredit$Content,n=5, topics=1)

For which the output is:

"Error in findThoughts(sovereigncredit1, texts = sovereigncredit$Content, : Number of provided texts and number of documents modeled do not match"

In which "sovereigncredit1" is a topic model based on "out" from above.

If my interpretation is correct (and I'm not making another mistake), the problem seems to be this 1 observation difference in the number of observations pre and post textprocessing.

So far, I've looked at the original csv and made sure there are in fact 439 valid observations and no empty rows. I'm not sure what's up. Any help would be appreciated!

Upvotes: 1

Views: 935

Answers (1)

bstewart
bstewart

Reputation: 508

stm can't handle empty documents so we simply drop them. textProcessor removes a lot of stuff from texts: custom stopwords, words shorter than 3 characters, numbers etc. So what's happening here is one of your documents (whichever one is dropped) is essentially losing all of its contents sometime during the process of doing the various things textProcessor does.

You can work back what document it was and decide what you want to do about that in this instance. In general if you want more control over the text manipulation, I would strongly recommend the quanteda package which has much more fine-grained tools than stm for manipulating texts into a document-term matrix.

Upvotes: 1

Related Questions