Words used in Bag of words along with frequency in keras tokenizer

Question

I just want to know, how to identify or get a list of words along with their frequency that are considered for bag of words by keras tokenizer. Consider the below example

from tensorflow.keras.preprocessing import text
my_list = [["a", "a", "a", "b","c"], ["b", "c","c", "b", "c" "a"]]

Here I am selecting a vocab size of 2. One will be used for padding and other will be used by the words with highest frequency in my_list.

m_tokenizer = text.Tokenizer(num_words=2)
m_tokenizer.fit_on_texts(my_list)

bag of words using tokenizer

bow = tokenizer.text_to_matrix(my_list)

bow

array([[0., 1.],
         [0., 1.]])

I can easily get a dict of all the words along with their indexing which tokenizer use internally. m_tokenizer.word_index

{'a': 1, 'c': 2, 'b': 3}

Now I want to know when I selected num_words=2 which words are used by tokenizer along with their frequency in the corpus to build Bag of words ?(Obviously first one is for padding) Say for example here use is having maximum freq in my_list and it is used to form bow. Now I can a method that help me fetch a dict(or may be something) that gives me

 {"a":4} # as count of a is 4 is my_list

Marco Cerliani · Accepted Answer

you can access to the counter of ALL the words found in the original text using m_tokenizer.word_counts. it returns OrderedDict([('a', 4), ('b', 3), ('c', 4)])

if you want to limit the dictionary on the max num_words you defined you can do automatically:

for i, (word, count) in enumerate(m_tokenizer.word_counts.items()):
    if i < m_tokenizer.num_words-1:
        print((word, count)) # print or store in an object

Words used in Bag of words along with frequency in keras tokenizer

Answers (2)

Related Questions