Hardik Gupta
Hardik Gupta

Reputation: 4790

text2vec: Iterate over the vocabulary after using function create_vocabulary

Using text2vec package, I created a vocabulary.

vocab = create_vocabulary(it_0, ngram = c(2L, 2L)) 

vocab looks something like this

> vocab
Number of docs: 120 
0 stopwords:  ... 
ngram_min = 2; ngram_max = 2 
Vocabulary: 
                    terms terms_counts doc_counts
    1:    knight_severely            1          1
    2:       movie_expect            1          1
    3: recommend_watching            1          1
    4:        nuke_entire            1          1
    5:      sense_keeping            1          1
   ---                                           
14467:         stand_idly            1          1
14468:    officer_loyalty            1          1
14469:    willingness_die            1          1
14470:         fight_bane            3          3
14471:     bane_beginning            1          1

How can I check the range of the column terms_counts? I need this because it will be helpful for me during pruning which is my next step

pruned_vocab = prune_vocabulary(vocab, term_count_min = <BLANK>)

Below code is reproducible

library(text2vec)

text <- c(" huge fan superhero movies expectations batman begins viewing christopher 
          nolan production pleasantly shocked huge expectations dark knight christopher 
          nolan blew expectations dust happen film dark knight rises simply big expectations 
          blown production true cinematic experience behold movie exceeded expectations terms 
          action entertainment",                                                       
          "christopher nolan outdone morning tired awake set film films genuine emotional 
          eartbeat felt flaw nolan films vision emotion hollow bought felt hero villain 
          alike christian bale typically brilliant batman felt bruce wayne heavily embraced
          final installment bale added emotional depth character plot point astray dark knight")

it_0 = itoken( text,
               tokenizer = word_tokenizer,
               progressbar = T)

vocab = create_vocabulary(it_0, ngram = c(2L, 2L)) 
vocab

Upvotes: 1

Views: 320

Answers (2)

Dmitriy Selivanov
Dmitriy Selivanov

Reputation: 4595

vocab is a list of some meta-information (number of docs, ngram size, etc) and main data.frame/data.table with word counts and document per word counts.

As already mentioned vocab$vocab is what you need (data.table with counts).

You can finds internal structure by calling str(vocab):

List of 5
 $ vocab         :Classes ‘data.table’ and 'data.frame':    82 obs. of  3 variables:
  ..$ terms       : chr [1:82] "plot_point" "depth_character" "emotional_depth" "bale_added" ...
  ..$ terms_counts: int [1:82] 1 1 1 1 1 1 1 1 1 1 ...
  ..$ doc_counts  : int [1:82] 1 1 1 1 1 1 1 1 1 1 ...
  ..- attr(*, ".internal.selfref")=<externalptr> 
 $ ngram         : Named int [1:2] 2 2
  ..- attr(*, "names")= chr [1:2] "ngram_min" "ngram_max"
 $ document_count: int 2
 $ stopwords     : chr(0) 
 $ sep_ngram     : chr "_"
 - attr(*, "class")= chr "text2vec_vocabulary"

Upvotes: 1

Imran Ali
Imran Ali

Reputation: 2279

Try range(vocab$vocab$terms_counts)

Upvotes: 1

Related Questions