robertspierre
robertspierre

Reputation: 4431

Cannot make sense of keras.datasets.imdb

I have two problems:

  1. First off, the documentation for tf.keras.datasets.imdb.get_word_index says

Retrieves the dictionary mapping word indices back to words.

While in fact it's the contrary,

print(tf.keras.datasets.imdb.get_word_index())

{'fawn': 34701, 'tsukino': 52006, 'nunnery': 52007

  1. I tried to run this in TensorFlow 2.0

(train_data_raw, train_labels), (test_data_raw, test_labels) = keras.datasets.imdb.load_data()
words2idx = tf.keras.datasets.imdb.get_word_index()
idx2words = {idx:word for word, idx in words2idx.items()}
i = 0
train_ex = [idx2words[x] for x in train_data_raw[0]]
train_ex = ' '.join(train_ex)
print(train_ex)

This result in a nonsense string

the as you with out themselves powerful lets loves their [...]

Shouldn't I get a valid movie review?

Upvotes: 2

Views: 874

Answers (1)

ad2004
ad2004

Reputation: 809

I did a bit of digging and found that there are a few "offsets" in the processing which need to be undone in order to return a sensible review language. I modified your line to subtract 3 from the index that appears in the raw sequence (since the default is to start real words with index=3), and also the first character is a dummy marker (set to 1), so the real text starts at position 2 (or index 1 in python).

train_ex = [idx2words[x-3] for x in train_data_raw[0][1:]]

Using the above modification gives me the following for the review you originally selected:

this film was just brilliant casting location scenery story direction everyone's really suited the part they played ...

It appears that some punctuation and capitalization is removed etc, but this seems to return sensible reviews.

I hope this helps.

Upvotes: 4

Related Questions