Ofer Sadan
Ofer Sadan

Reputation: 11932

Machine learning multi-classification: Why use 'one-hot' encoding instead of a number

I'm currently working on a classification problem with tensorflow, and i'm new to the world of machine learning, but I don't get something.

I have successfully tried to train models that output the y tensor like this:

y = [0,0,1,0]

But I can't understand the principal behind it...

Why not just train the same model to output classes such as y = 3 or y = 4

This seems much more flexible, because I can imagine having a multi-classification problem with 2 million possible classes, and it would be much more efficient to output a number between 0-2,000,000 than to output a tensor of 2,000,000 items for every result.

What am I missing?

Upvotes: 1

Views: 3960

Answers (3)

P-Gn
P-Gn

Reputation: 24591

Neural networks use gradient descent to optimize a loss function. In turn, this loss function needs to be differentiable.

A discrete output would be (indeed is) a perfectly valid and valuable output for a classification network. Problem is, we don't know how to optimize this net efficiently.

Instead, we rely on a continuous loss function. This loss function is usually based on something that is more or less related to the probability of each label -- and for this, you need a network output that has one value per label.

Typically, the output that you describe is then deduced from this soft, continuous output by taking the argmax of these pseudo-probabilities.

Upvotes: 1

nessuno
nessuno

Reputation: 27052

Ideally, you could train you model to classify input instances and producing a single output. Something like

y=1 means input=dog, y=2 means input=airplane. An approach like that, however, brings a lot of problems:

  1. How do I interpret the output y=1.5?
  2. Why I'm trying the regress a number like I'm working with continuous data while I'm, in reality, working with discrete data?

In fact, what are you doing is treating a multi-class classification problem like a regression problem. This is locally wrong (unless you're doing binary classification, in that case, a positive and a negative output are everything you need).

To avoid these (and other) issues, we use a final layer of neurons and we associate an high-activation to the right class.

The one-hot encoding represents the fact that you want to force your network to have a single high-activation output when a certain input is present.

This, every input=dog will have 1, 0, 0 as output and so on.

In this way, you're correctly treating a discrete classification problem, producing a discrete output and well interpretable (in fact you'll always extract the output neuron with the highest activation using tf.argmax, even though your network hasn't learned to produce the perfect one-hot encoding you'll be able to extract without doubt the most likely correct output )

Upvotes: 4

Matthias Winkelmann
Matthias Winkelmann

Reputation: 16394

The answer is in how that final tensor, or single value, are calculated. In an NN, your y=3 would be build by a weighted sum over the values of the previous layer.

Trying to train towards single values would then imply a linear relationship between the category IDs where none exists: For the true value y=4, the output y=3 would be considered better than y=1 even though the categories are random, and may be 1: dogs, 3: cars, 4: cats

Upvotes: 1

Related Questions