MichaelJanz
MichaelJanz

Reputation: 1815

Tensorflow: create y-indices from class labels

I have class labels as:

y = ["class1", "class2", "class3"]

for using them in a model, I want to convert these classes to y_indices as 1, 2 with methods of keras and/or tensorflow2.0.

What I am doing currently is:

tokenizer = tf.keras.preprocessing.text.Tokenizer()
tokenizer.fit_on_texts(y)
y_train = tokenizer.texts_to_sequences(y)

I know that the tokenizer is kind of misused here. Are there better and smaller solutions for converting class labels to indices? Thanks.

Upvotes: 2

Views: 1209

Answers (1)

Nicolas Gervais
Nicolas Gervais

Reputation: 36604

You can't use a Tokenizer for this because the Tokenizer indexing starts at 1, and not 0. You can use tf.where:

import tensorflow as tf

y = ['class3', 'class1', 'class1', 'class2', 'class3', 'class1', 'class2']

names = ["class1", "class2", "class3"]

labeler = lambda x: tf.where(tf.equal(x, names))

dataset = tf.data.Dataset.from_tensor_slices(y).map(labeler)

next(iter(dataset))
<tf.Tensor: shape=(1, 1), dtype=int64, numpy=array([[2]], dtype=int64)>

If you want to do it on a list or Numpy array you can use Scikit-Learn:

from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
    
le.fit_transform(y) 
array([2, 0, 0, 1, 2, 0, 1], dtype=int64)

As I said previously, your implementation started indexing at 1:

[[2], [1], [1], [3], [2], [1], [3]]

This crashes Keras when it measures loss and metrics. It will return nan because you'll have three final neurons, but targets srtating from the 2nd index to the 4th. tl;dr don't use indexing that starts at 1 with Keras.

Upvotes: 1

Related Questions