rjg
rjg

Reputation: 65

How to incorporate features from a latent semantic analysis as independent variables in a predictive model

I am trying to run logistic regression using text data in R. I have built a term document matrix and a corresponding latent semantic space. In my understanding, LSA is used in deriving 'concepts' out of 'terms' which could help in dimension reduction. Here's my code:

tdm = TermDocumentMatrix(corpus, control = list(tokenize=myngramtoken,weighting=myweight))
tdm = removeSparseTerms(tdm,0.98)
tdm = as.matrix(tdm)
tdm.lsa = lsa(tdm,dimcalc_share())
tdm.lsa_tk=as.data.frame(tdm.lsa$tk)
tdm.lsa_dk=as.data.frame(tdm.lsa$dk)
tdm.lsa_sk=as.data.frame(tdm.lsa$sk)

This gives features as V1, V2, V3.... V21. Is it possible to use these as the independent variables in my logistic regression? If so, how can I do it?

Upvotes: 0

Views: 339

Answers (1)

rjg
rjg

Reputation: 65

In the above example the table tdm.lsa_dk is a matrix of 'concepts' as columns and the documents where they appear as rows. This can be used as the new training and testing data set for further analysis, in this case, logistic regression. The independent variable (from the original dataset) is to be added to the new dataset. The table tdm.lsa_sk can be used for variable selection. It shows the 'concept' variables in decreasing order of importance.

     # the $dk part of the lsa will behave as your new dataset 

    new.dataset <- tdm.lsa_dk 
    new.dataset$y.var <- original.dataset$y.var

     # creating training and testing dataset out of the new dataset

    test_index <- createDataPartition(new.dataset$y, p = .2, list = F)
    Test<-new.dataset[test_index,]
    Train<-new.dataset[-test_index,]

     # create model

    model<-glm(y.var~., data=Train, family="binomial")
    prediction<-predict(model, Test, type="response")

Upvotes: 0

Related Questions