NHH
NHH

Reputation: 11

quanteda - find most frequent terms by percentages

I often use the following codes to find the top-n features from the text:

    top_n_terms <- text %>% 
      tokens(remove_numbers = TRUE, remove_punct = TRUE, remove_symbols = TRUE) %>% 
      tokens_ngrams(n = 1:3) %>% 
      dfm() %>% 
      dfm_remove(stopwords("en")) %>% 
      topfeatures(n = 3000)

Moreover, I want to find the top 10% features with the xx% input instead of n = xx. How could I adjust the code?

Thank you a lot for your help.

Upvotes: 0

Views: 179

Answers (1)

phiver
phiver

Reputation: 23598

You can use the function nfeat() to get the number of features. Multiply this with 0.1 and you can get it automatically. Or you can define a variable before this code where you specify the needed percentage.

See example below using data_corpus_inaugural. Using your selections this should have 190261 features, but using the code with the 10% selection it returns 19026 features.

top_n_terms <- data_corpus_inaugural  %>% 
  tokens(remove_numbers = TRUE, remove_punct = TRUE, remove_symbols = TRUE) %>% 
  tokens_ngrams(n = 1:3) %>% 
  dfm() %>% 
  dfm_remove(stopwords("en")) %>% 
  topfeatures(n = nfeat(.) * 0.1) # note the . in nfeat. replace 0.1 with variable if needed.

Upvotes: 1

Related Questions