Reputation: 4431
I have two problems:
tf.keras.datasets.imdb.get_word_index
saysRetrieves the dictionary mapping word indices back to words.
While in fact it's the contrary,
print(tf.keras.datasets.imdb.get_word_index())
{'fawn': 34701, 'tsukino': 52006, 'nunnery': 52007
I tried to run this in TensorFlow 2.0
(train_data_raw, train_labels), (test_data_raw, test_labels) = keras.datasets.imdb.load_data()
words2idx = tf.keras.datasets.imdb.get_word_index()
idx2words = {idx:word for word, idx in words2idx.items()}
i = 0
train_ex = [idx2words[x] for x in train_data_raw[0]]
train_ex = ' '.join(train_ex)
print(train_ex)
This result in a nonsense string
the as you with out themselves powerful lets loves their [...]
Shouldn't I get a valid movie review?
Upvotes: 2
Views: 874
Reputation: 809
I did a bit of digging and found that there are a few "offsets" in the processing which need to be undone in order to return a sensible review language. I modified your line to subtract 3 from the index that appears in the raw sequence (since the default is to start real words with index=3), and also the first character is a dummy marker (set to 1), so the real text starts at position 2 (or index 1 in python).
train_ex = [idx2words[x-3] for x in train_data_raw[0][1:]]
Using the above modification gives me the following for the review you originally selected:
this film was just brilliant casting location scenery story direction everyone's really suited the part they played ...
It appears that some punctuation and capitalization is removed etc, but this seems to return sensible reviews.
I hope this helps.
Upvotes: 4