Reputation: 65
I am trying to run logistic regression using text data in R. I have built a term document matrix and a corresponding latent semantic space. In my understanding, LSA is used in deriving 'concepts' out of 'terms' which could help in dimension reduction. Here's my code:
tdm = TermDocumentMatrix(corpus, control = list(tokenize=myngramtoken,weighting=myweight))
tdm = removeSparseTerms(tdm,0.98)
tdm = as.matrix(tdm)
tdm.lsa = lsa(tdm,dimcalc_share())
tdm.lsa_tk=as.data.frame(tdm.lsa$tk)
tdm.lsa_dk=as.data.frame(tdm.lsa$dk)
tdm.lsa_sk=as.data.frame(tdm.lsa$sk)
This gives features as V1, V2, V3.... V21. Is it possible to use these as the independent variables in my logistic regression? If so, how can I do it?
Upvotes: 0
Views: 339
Reputation: 65
In the above example the table tdm.lsa_dk is a matrix of 'concepts' as columns and the documents where they appear as rows. This can be used as the new training and testing data set for further analysis, in this case, logistic regression. The independent variable (from the original dataset) is to be added to the new dataset. The table tdm.lsa_sk can be used for variable selection. It shows the 'concept' variables in decreasing order of importance.
# the $dk part of the lsa will behave as your new dataset
new.dataset <- tdm.lsa_dk
new.dataset$y.var <- original.dataset$y.var
# creating training and testing dataset out of the new dataset
test_index <- createDataPartition(new.dataset$y, p = .2, list = F)
Test<-new.dataset[test_index,]
Train<-new.dataset[-test_index,]
# create model
model<-glm(y.var~., data=Train, family="binomial")
prediction<-predict(model, Test, type="response")
Upvotes: 0