Reputation: 11
I often use the following codes to find the top-n features from the text:
top_n_terms <- text %>%
tokens(remove_numbers = TRUE, remove_punct = TRUE, remove_symbols = TRUE) %>%
tokens_ngrams(n = 1:3) %>%
dfm() %>%
dfm_remove(stopwords("en")) %>%
topfeatures(n = 3000)
Moreover, I want to find the top 10% features with the xx% input instead of n = xx
. How could I adjust the code?
Thank you a lot for your help.
Upvotes: 0
Views: 179
Reputation: 23598
You can use the function nfeat()
to get the number of features. Multiply this with 0.1 and you can get it automatically. Or you can define a variable before this code where you specify the needed percentage.
See example below using data_corpus_inaugural. Using your selections this should have 190261 features, but using the code with the 10% selection it returns 19026 features.
top_n_terms <- data_corpus_inaugural %>%
tokens(remove_numbers = TRUE, remove_punct = TRUE, remove_symbols = TRUE) %>%
tokens_ngrams(n = 1:3) %>%
dfm() %>%
dfm_remove(stopwords("en")) %>%
topfeatures(n = nfeat(.) * 0.1) # note the . in nfeat. replace 0.1 with variable if needed.
Upvotes: 1