Reputation: 704
I run a randomforest on n-gram matrix of articles, because I would like to classify it to 2 categories. As a result of RF I received a list of important variables.
Now I would like to run random forest only on the selected first n features and then use the same features for predicting new classification. For that I need to create dfm only for most important variables (from RF). How can I create a dictionary from a list of those important variables?
The relevant part of the code... after creating a dictionary I have only one entry in it. How to create it properly?
forestModel <-
randomForest(x = as.matrix(myStemMat),y=as.factor(classVect),
ntree = 1000 )
impVariables <-
data.frame(important = as.matrix(importance(forestModel)))
impVariables <-
impVariables %>% mutate(impVar = row.names(impVariables)) %>%
arrange(desc(MeanDecreaseGini)) %>%
top_n(1000, wt = MeanDecreaseGini) %>%
select(impVar) %>% as.list() %>% dictionary()
myStemMat <-
dfm(
mycorpus,
dictionary=impVariables,
# remove = stopwordsPL,
stem = TRUE,
remove_punct = TRUE,
ngrams=c(1,2)
)
In brief, when I have a list of strings, of words, n-grams, how can I create a dictionary so that I can use it in the dfm()
function to generate term matrix?
Here is a link to complete code "reproducible example" and data it uses. https://www.dropbox.com/s/3oe1tcfcauer0wf/text_data.zip?dl=0
Upvotes: 1
Views: 312
Reputation: 14902
You should read the ?dictionary
carefully, since this not designed to be a set for feature selection (although it can be), but rather to create equivalence classes among values assigned to dictionary keys.
If your impVariables
is a character vector of features, then you should be able to use these commands to perform the selection you want:
toks <-
tokens(mycorpus, remove_punct = TRUE) %>%
tokens_select(impVariables, padding = TRUE) %>%
tokens_wordstem() %>%
tokens_ngrams(n = 1:2)
dfm(toks)
where the last command produces a document-feature matrix of just the stemmed, ngram features that were selected in the top features from your random forest model. Note that the padding = TRUE
will prevent ngrams from forming that were never adjacent in your original text. If you don't care about that, set it to FALSE
(the default).
ADDED:
To select the columns of the dfm from a character vector of selection words, here's two methods we can use.
We will work with these sample objects:
# two sample texts and their dfm representations
txt1 <- c(d1 = "a b c f g h",
d2 = "a a c c d f f f")
txt2 <- c(d1 = "c c d f g h",
d2 = "b b d i j")
(dfm1 <- dfm(txt1))
# Document-feature matrix of: 2 documents, 7 features (28.6% sparse).
# 2 x 7 sparse Matrix of class "dfmSparse"
# features
# docs a b c f g h d
# d1 1 1 1 1 1 1 0
# d2 2 0 2 3 0 0 1
(dfm2 <- dfm(txt2))
# Document-feature matrix of: 2 documents, 8 features (43.8% sparse).
# 2 x 8 sparse Matrix of class "dfmSparse"
# features
# docs c d f g h b i j
# d1 2 1 1 1 1 0 0 0
# d2 0 1 0 0 0 2 1 1
impVariables <- c("a", "c", "e", "z")
First Method: Create a dfm and select on that using dfm_select()
Here, we are creating a dfm from the character vector of your features, just so that we register them as features, because of the way that dfm_select()
works when the selection object is a dfm.
impVariablesDfm <- dfm(paste(impVariables, collapse = " "))
dfm_select(dfm1, impVariablesDfm)
# Document-feature matrix of: 2 documents, 4 features (50% sparse).
# 2 x 4 sparse Matrix of class "dfmSparse"
# features
# docs a c e z
# d1 1 1 0 0
# d2 2 2 0 0
dfm_select(dfm2, impVariablesDfm)
# Document-feature matrix of: 2 documents, 4 features (87.5% sparse).
# 2 x 4 sparse Matrix of class "dfmSparse"
# features
# docs a c e z
# d1 0 2 0 0
# d2 0 0 0 0
Second Method: Create a dictionary and select on that using dfm_lookup()
Let's create a helper function to create a dictionary from a character vector:
# make a dictionary where each key = its value
char2dictionary <- function(x) {
result <- as.list(x) # make the vector into a list
names(result) <- x
dictionary(result)
}
Now using dfm lookup, we get only the keys, even ones that were not observed:
dfm_lookup(dfm1, dictionary = char2dictionary(impVariables))
# Document-feature matrix of: 2 documents, 4 features (50% sparse).
# 2 x 4 sparse Matrix of class "dfmSparse"
# features
# docs a c e z
# d1 1 1 0 0
# d2 2 2 0 0
dfm_lookup(dfm2, dictionary = char2dictionary(impVariables))
# Document-feature matrix of: 2 documents, 4 features (87.5% sparse).
# 2 x 4 sparse Matrix of class "dfmSparse"
# features
# docs a c e z
# d1 0 2 0 0
# d2 0 0 0 0
Note: (but the first one at least will work with v0.9.9.65):
packageVersion("quanteda")
# [1] ‘0.9.9.85’
Upvotes: 2