Reputation: 959
I just want to know, how to identify or get a list of words along with their frequency that are considered for bag of words by keras tokenizer. Consider the below example
from tensorflow.keras.preprocessing import text
my_list = [["a", "a", "a", "b","c"], ["b", "c","c", "b", "c" "a"]]
Here I am selecting a vocab size of 2. One will be used for padding and other will be used by the words with highest frequency in my_list.
m_tokenizer = text.Tokenizer(num_words=2)
m_tokenizer.fit_on_texts(my_list)
bag of words using tokenizer
bow = tokenizer.text_to_matrix(my_list)
bow
array([[0., 1.],
[0., 1.]])
I can easily get a dict of all the words along with their indexing which tokenizer use internally. m_tokenizer.word_index
{'a': 1, 'c': 2, 'b': 3}
Now I want to know when I selected num_words=2 which words are used by tokenizer along with their frequency in the corpus to build Bag of words ?(Obviously first one is for padding) Say for example here use is having maximum freq in my_list and it is used to form bow. Now I can a method that help me fetch a dict(or may be something) that gives me
{"a":4} # as count of a is 4 is my_list
Upvotes: 1
Views: 1944
Reputation: 111
You can use count
mode of tokenizer to generate the required list
bow = m_tokenizer.texts_to_matrix(my_list, mode='count')
req_dict = {}
for key,value in m_tokenizer.word_index.items():
if int(value) < num_words:
req_dict[key] = int(bow[0][int(value)])
print(req_dict)
Upvotes: 1
Reputation: 22031
you can access to the counter of ALL the words found in the original text using m_tokenizer.word_counts
. it returns OrderedDict([('a', 4), ('b', 3), ('c', 4)])
if you want to limit the dictionary on the max num_words you defined you can do automatically:
for i, (word, count) in enumerate(m_tokenizer.word_counts.items()):
if i < m_tokenizer.num_words-1:
print((word, count)) # print or store in an object
Upvotes: 2