Reputation: 1815
I have class labels as:
y = ["class1", "class2", "class3"]
for using them in a model, I want to convert these classes to y_indices as 1, 2 with methods of keras and/or tensorflow2.0.
What I am doing currently is:
tokenizer = tf.keras.preprocessing.text.Tokenizer()
tokenizer.fit_on_texts(y)
y_train = tokenizer.texts_to_sequences(y)
I know that the tokenizer is kind of misused here. Are there better and smaller solutions for converting class labels to indices? Thanks.
Upvotes: 2
Views: 1209
Reputation: 36604
You can't use a Tokenizer for this because the Tokenizer indexing starts at 1, and not 0. You can use tf.where
:
import tensorflow as tf
y = ['class3', 'class1', 'class1', 'class2', 'class3', 'class1', 'class2']
names = ["class1", "class2", "class3"]
labeler = lambda x: tf.where(tf.equal(x, names))
dataset = tf.data.Dataset.from_tensor_slices(y).map(labeler)
next(iter(dataset))
<tf.Tensor: shape=(1, 1), dtype=int64, numpy=array([[2]], dtype=int64)>
If you want to do it on a list or Numpy array you can use Scikit-Learn:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
le.fit_transform(y)
array([2, 0, 0, 1, 2, 0, 1], dtype=int64)
As I said previously, your implementation started indexing at 1:
[[2], [1], [1], [3], [2], [1], [3]]
This crashes Keras when it measures loss and metrics. It will return nan
because you'll have three final neurons, but targets srtating from the 2nd index to the 4th. tl;dr don't use indexing that starts at 1 with Keras.
Upvotes: 1