Stat Tistician
Stat Tistician

Reputation: 883

Label tokenizer not working, loss and accuracy cannot be calculated

I am using Keras Tensorflow for NLP, I am currently working on the imdb reviews dataset. I would like to make use of the hub.KerasLayer. I want to pass the actual x and y values directly. So the sentences as x and the labels as y in my model.fit statement. My code:

import csv
import tensorflow as tf
import tensorflow_datasets as tfds
import numpy as np
import tensorflow_hub as hub
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

imdb, info = tfds.load("imdb_reviews", with_info=True, as_supervised=True)

imdb_train=imdb['train']
imdb_test=imdb['test']

training_sentences=[]
training_labels=[]

test_sentences=[]
test_labels=[]

for a,b in imdb_train:
  training_sentences.append(a.numpy().decode("utf8"))
  training_labels.append(b.numpy())

for a,b in imdb_test:
  test_sentences.append(a.numpy().decode("utf8"))
  test_labels.append(b.numpy())

model = "https://tfhub.dev/google/tf2-preview/gnews-swivel-20dim/1"
hub_layer = hub.KerasLayer(model, output_shape=[20], input_shape=[], 
                           dtype=tf.string, trainable=True)

model = tf.keras.Sequential()
model.add(hub_layer)
model.add(tf.keras.layers.Dense(16, activation='relu'))
model.add(tf.keras.layers.Dense(1))

model.compile(loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),optimizer='adam', metrics=[tf.metrics.BinaryAccuracy(threshold=0.0, name='accuracy')])

Trying

history = model.fit(x=training_sentences,
                      y=training_labels,
                      validation_data=(test_sentences, test_labels),
                      epochs=2)

doesn't work, because training_labels is not in the correct shape/format. My approach now is to make use of applying the tokenizer again, because I then get the result (from texts_to_sequences) in the correct format/shape. For this I have to first transform it into yes/no (or a/b whatever) string.

training_labels_test=[]
for i in training_labels:
   if i==0: training_labels_test.append("no")
   if i==1: training_labels_test.append("yes")
  
testtokenizer=Tokenizer()
testtokenizer.fit_on_texts(training_labels_test)
test_labels_pad=testtokenizer.texts_to_sequences(training_labels_test)

val_labels_test=[]
for i in test_labels:
   if i==0: val_labels_test.append("no")
   if i==1: val_labels_test.append("yes")

testtokenizer.fit_on_texts(val_labels_test)
val_labels_pad=testtokenizer.texts_to_sequences(val_labels_test)

Because I now have 1 and 2 as labels, I need to update my model:

model = tf.keras.Sequential()
model.add(hub_layer)
model.add(tf.keras.layers.Dense(16, activation='relu'))
model.add(tf.keras.layers.Dense(2))

model.compile(optimizer='adam',
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=['accuracy'])

I then try to fit it:

history = model.fit(x=training_sentences,
                      y=test_labels_pad,
                      validation_data=(test_sentences, val_labels_pad),
                      epochs=2)

Problem is that loss is nan and accuracy is not calculated correctly.

examp

Where is the mistake?

Please not that my question is really about this specific way and why this tokenizer is not working. I am aware that there are other possibilities which would work.

Upvotes: 0

Views: 468

Answers (1)

Nicolas Gervais
Nicolas Gervais

Reputation: 36724

The problem seems to be two-fold.

First, binary targets should always be [0, 1] and not [1, 2]. So, I subtracted one from your targets. Tokenizer() isn't made to encode labels, you should use tfds.features.ClassLabel() for that. For now, I just subtracted 1 in the fit() call.

history = model.fit(x=training_sentences,
                      y=list(map(lambda x: x[0] - 1, test_labels_pad)),
                      validation_data=(test_sentences, 
                                       list(map(lambda x: x[0] - 1, val_labels_pad))),
                      epochs=1)

Second, your input layer returned only nan for some reason. On the page of the pre-trained model, they say that:

google/tf2-preview/gnews-swivel-20dim-with-oov/1 - same as google/tf2-preview/gnews-swivel-20dim/1, but with 2.5% vocabulary converted to OOV buckets. This can help if vocabulary of the task and vocabulary of the model don't fully overlap.

And so you should use the second one since your dataset doesn't fully overlap with the data it was trained on. Then, your model will start learning.

model = "https://tfhub.dev/google/tf2-preview/gnews-swivel-20dim-with-oov/1"
hub_layer = hub.KerasLayer(model, output_shape=[20], input_shape=[],
                           dtype=tf.string, trainable=True)

Full running code:

import csv
import tensorflow as tf
import tensorflow_datasets as tfds
import numpy as np
import tensorflow_hub as hub
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

imdb, info = tfds.load("imdb_reviews", with_info=True, as_supervised=True)

imdb_train=imdb['train']
imdb_test=imdb['test']

training_sentences=[]
training_labels=[]

test_sentences=[]
test_labels=[]

for a,b in imdb_train:
  training_sentences.append(a.numpy().decode("utf8"))
  training_labels.append(b.numpy())

for a,b in imdb_test:
  test_sentences.append(a.numpy().decode("utf8"))
  test_labels.append(b.numpy())

training_labels_test = []
for i in training_labels:
    if i == 0: training_labels_test.append("no")
    if i == 1: training_labels_test.append("yes")

testtokenizer = Tokenizer()
testtokenizer.fit_on_texts(training_labels_test)
test_labels_pad = testtokenizer.texts_to_sequences(training_labels_test)

val_labels_test = []
for i in test_labels:
    if i == 0: val_labels_test.append("no")
    if i == 1: val_labels_test.append("yes")

testtokenizer.fit_on_texts(val_labels_test)
val_labels_pad = testtokenizer.texts_to_sequences(val_labels_test)

model = "https://tfhub.dev/google/tf2-preview/gnews-swivel-20dim-with-oov/1"
hub_layer = hub.KerasLayer(model, output_shape=[20], input_shape=[],
                           dtype=tf.string, trainable=True)

model = tf.keras.Sequential()
model.add(hub_layer)
model.add(tf.keras.layers.Dense(16, activation='relu'))
model.add(tf.keras.layers.Dense(2))

model.compile(optimizer='adam',
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=[tf.keras.metrics.SparseCategoricalAccuracy()])

history = model.fit(x=training_sentences,
                      y=list(map(lambda x: x[0] - 1, test_labels_pad)),
                      validation_data=(test_sentences, 
                      list(map(lambda x: x[0] - 1, val_labels_pad))),
                      epochs=1)

model.predict(training_sentences)
24896/25000 [==================>.] - ETA: 0s - loss: 0.5482 - sparse_cat_acc: 0.7312
array([[-0.94201976, -1.3173063 ],
       [-3.7894788 , -3.0269182 ],
       [-3.0404441 , -3.4826043 ],
       ...,
       [-2.8379505 , -1.2451388 ],
       [-0.7685702 , -3.1836908 ],
       [-1.7252465 , -3.8163807 ]], dtype=float32)

Look what happens if you have 3 categories, and use [1, 2, 3] instead of [0, 1, 2]:

y_true = tf.constant([1, 2, 3])
y_pred = tf.constant([[0.05, 0.95, 0], [0.1, 0.8, 0.1], [.2, .4, .4]])
scce = tf.keras.losses.SparseCategoricalCrossentropy()
scce(y_true, y_pred).numpy()
nan

But it works with [0, 1, 2]:

y_true = tf.constant([0, 1, 2])
y_pred = tf.constant([[0.05, 0.95, 0], [0.1, 0.8, 0.1], [.2, .4, .4]])
scce = tf.keras.losses.SparseCategoricalCrossentropy()
scce(y_true, y_pred).numpy()
1.3783889

Upvotes: 2

Related Questions