Reputation: 704
What is the syntax in text2vec to vectorize texts and achieve dtm with only the indicated list of words?
How to vectorize and produce document term matrix only on indicated features? And if the features do not appear in the text the variable should stay empty.
I need to produce term document matrices with exactly the same columns as in the dtm I run the modelling on, otherwise I cannot use random forest model on new documents.
Upvotes: 0
Views: 456
Reputation: 14902
I need to produce term document matrices with exactly the same columns as in the dtm I run the modelling on, otherwise I cannot use random forest model on new documents.
In quanteda you can set the features of a test set identical to that of a training set using dfm_select()
. For example, to make dfm1
below have identical features to dfm2
:
txts <- c("a b c d", "a a b b", "b c c d e f")
(dfm1 <- dfm(txts[1:2]))
## Document-feature matrix of: 2 documents, 4 features (25% sparse).
## 2 x 4 sparse Matrix of class "dfmSparse"
## features
## docs a b c d
## text1 1 1 1 1
## text2 2 2 0 0
(dfm2 <- dfm(txts[2:3]))
## Document-feature matrix of: 2 documents, 6 features (41.7% sparse).
## 2 x 6 sparse Matrix of class "dfmSparse"
## features
## docs a b c d e f
## text1 2 2 0 0 0 0
## text2 0 1 2 1 1 1
dfm_select(dfm1, dfm2, valuetype = "fixed", verbose = TRUE)
## kept 4 features, padded 2 features
## Document-feature matrix of: 2 documents, 6 features (50% sparse).
## 2 x 6 sparse Matrix of class "dfmSparse"
## features
## docs a b c d e f
## text1 1 1 1 1 0 0
## text2 2 2 0 0 0 0
For feature-context matrixes (what text2vec needs for an input) however, this will not work because the co-occurrences (at least those computed with a window rather than document context) are interdependent across features, so you cannot simply add and remove them in the same way.
Upvotes: 2
Reputation: 4595
You can create document term matrix only from specific set of features:
v = create_vocabulary(c("word1", "word2"))
vectorizer = vocab_vectorizer(v)
dtm_test = create_dtm(it, vectorizer)
However I don't recommend to 1) use random forest on such sparse data - it won't work good 2) perform feature selection way you described - you will likely overfit.
Upvotes: 2