Marco
Marco

Reputation: 707

Tensorflow - use string labels to train neural network

For an university project I have to implement a neural network for an OCR task using Tensorflow. The training dataset consists of two files, train-data.csv and train-target.csv. In train-data file every row is filled with bits of an 16x8 bitmap, in train-target file every row is a character [a-z] which is the label for the corresponding row in train-data.

I'm having some issues with the label dataset, I've followed the tutorial with the MNIST dataset but here the difference is that I have string labels instead of a one-hot encoded vector. Following the tutorial I'm trying with the softmax function and the cross-entropy.

# First y * tf.log(y_hat) computes the element-wise multiplication of the two resulting vectors

# Second, tf.reduce_sum( , reduction_indices=[1]) computes the sum along the second dimension (the first one are the examples)
# Finally, tf.reduce_mean() computes the mean over the first dimension, i.e. the examples
cross_entropy = tf.reduce_mean(-tf.reduce_sum(tf.strings.to_number(y) * tf.math.log(y_hat), reduction_indices=[1]))

train_step = tf.compat.v1.train.GradientDescentOptimizer(0.5).minimize(cross_entropy)

In lines above I've used tf.strings.to_number(y) to convert the char to a numeric value.

This conversion is causing issues when I run the session because the run() method does not accept tensor objects.

for _ in range(1000):
    batch_xs, batch_ys = next_batch(100, raw_train_data, train_targets)
    sess.run(train_step, feed_dict={x: batch_xs, y: tf.strings.to_number(batch_ys.reshape((100,1)))})

If I don't convert the char to a numeric value I got this error:

InvalidArgumentError: StringToNumberOp could not correctly convert string: e
 [[{{node StringToNumber}}]]

I'm trying to figure out how to solve this issue or how to train a neural network using character labels, it's the whole day that I'm working on this problem. Does anyone know how to solve this?

Upvotes: 1

Views: 2421

Answers (1)

Marco
Marco

Reputation: 707

Finally I've found the error. Because I'm quite new to machine learning I've forgot that many algorithms does not handle categorical datasets.

The solution has been to perform a one-hot encoding on the target labels and feed this new array to the newtork with this function:

# define universe of possible input values
alphabet = 'abcdefghijklmnopqrstuvwxyz'

# define a mapping of chars to integers
char_to_int = dict((c, i) for i, c in enumerate(alphabet))
int_to_char = dict((i, c) for i, c in enumerate(alphabet))


def one_hot_encode(data_array):
    integer_encoded = [char_to_int[char] for char in data_array]

    # one hot encode
    onehot_encoded = list()
    for value in integer_encoded:
        letter = [0 for _ in range(len(alphabet))]
        letter[value] = 1
        onehot_encoded.append(letter)

    return onehot_encoded

Upvotes: 0

Related Questions