Reputation: 1197
I am trying to train & build a tokenizer using Keras & here is the snippet of code where I am doing this:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense
txt1="""What makes this problem difficult is that the sequences can vary in length,
be comprised of a very large vocabulary of input symbols and may require the model
to learn the long term context or dependencies between symbols in the input sequence."""
#txt1 is used for fitting
tk = Tokenizer(nb_words=2000, lower=True, split=" ",char_level=False)
tk.fit_on_texts(txt1)
#convert text to sequencech
t= tk.texts_to_sequences(txt1)
#padding to feed the sequence to keras model
t=pad_sequences(t, maxlen=10)
Upon testing which words the Tokenizer has learned, it gives that it has only learned characters but not words.
print(tk.word_index)
output:
{'e': 1, 't': 2, 'n': 3, 'a': 4, 's': 5, 'o': 6, 'i': 7, 'r': 8, 'l': 9, 'h': 10, 'm': 11, 'c': 12, 'u': 13, 'b': 14, 'd': 15, 'y': 16, 'p': 17, 'f': 18, 'q': 19, 'v': 20, 'g': 21, 'w': 22, 'k': 23, 'x': 24}
why it does not have any words ?
Furthermore, if I print t, it clearly shows that, words are ignored and each word is tokenized char by char
print(t)
Output:
[[ 0 0 0 ... 0 0 22]
[ 0 0 0 ... 0 0 10]
[ 0 0 0 ... 0 0 4]
...
[ 0 0 0 ... 0 0 12]
[ 0 0 0 ... 0 0 1]
[ 0 0 0 ... 0 0 0]]
Upvotes: 2
Views: 1675
Reputation: 1197
I figured out the error. If the text was passed as the following:
txt1=["""What makes this problem difficult is that the sequences can vary in length,
be comprised of a very large vocabulary of input symbols and may require the model
to learn the long term context or dependencies between symbols in the input sequence."""]
with the brackets, it will work just fine. Here is the new output of t:
print(t)
[[30 31 32 33 34 5 2 1 4 35]]
which means that the function takes a list rather than just a text
Upvotes: 2
Reputation: 87
Try this
from keras.preprocessing.text import Tokenizer
txt1='What makes this problem difficult is that the sequences can vary in length,
be comprised of a very large vocabulary of input symbols and may require the model
to learn the long term context or dependencies between symbols in the input sequence.'
t = Tokenizer()
t.fit_on_texts(txt1)
# summarize what was learned
print(t.word_counts)
print(t.document_count)
print(t.word_index)
print(t.word_docs)
copy paste it and run .
I am assuming the problem first is in the quotations around the input text "you got 3 quotations". Secondly you dont have to perform t= tk.texts_to_sequences(txt1)
instead do this
encoded_txt = t.texts_to_matrix(txt1, mode='count')
print(encoded_txt)
Other workaround is
from keras.preprocessing.text import text_to_word_sequence
text = txt1
# estimate the size of the vocabulary
words = set(text_to_word_sequence(text))
vocab_size = len(words)
print(vocab_size)
Upvotes: 0