Ishwar
Ishwar

Reputation: 84

How do I transform a column of values into possibly discrete training output classes?

My dataset is a set of features and one column which is essentially views, each element of this column being an arbitrary positive real number, like the following.

Now in Python, I want to run a Keras model on it. I plan to use this column as output labels for training the classifier. The only solution I can think of is to scale it using sklearn and then conditionally somehow sort the scaled values into classes for enough training labels. Like for example, if 0.1 < views < 0.2: set_scale_to_0.1 and so on. Is this the best way to go about it?

+-------+

| Views |

+-------+

| 173   |

+-------+

| 943   |

+-------+

Upvotes: 0

Views: 309

Answers (2)

Darius
Darius

Reputation: 12072

Sorry, but what's the content of the Views column? Is it an annotation (classification) or a counter of something (regression)?

If it's a column which contains labels, you can convert each single value to a one hot encoder (see the example below). That final representation can be used as output layer to fit a neural network.

import numpy as np
from keras.utils import to_categorical

views = np.random.randint(3, size=10)
print(views)
# [0 1 2 1 0 1 1 2 1 0]

num_classes = len(set(views))
print(num_classes)
# 3

views = to_categorical(views, num_classes)
print(views)
# [[1. 0. 0.]
#  [0. 0. 1.]
#  [1. 0. 0.]
#  [0. 0. 1.]
#  [0. 0. 1.]
#  [0. 1. 0.]
#  [0. 0. 1.]
#  [0. 1. 0.]
#  [0. 1. 0.]
#  [1. 0. 0.]]

Check https://keras.io/#getting-started-30-seconds-to-keras for an example. The variable views is y_train.

Upvotes: 1

muskrat
muskrat

Reputation: 1559

I suggest not trying to force this into a classification problem, but instead treating it as a regression. Two reasons:

First: your model targets (“labels” in classification) are not discrete, but integer valued. This means that you will lose information with any effort to discretize them.

Second: classification is useful when labels being near each other does not contain information (like class 1 and 2 are no more similar than class 1 and 4). However, you want to gain information from data points that are near each other in terms of views.

So, you likely want to use regression. You can do this with Keras no problem; you just have to change the last layer (and possibly some other things, depending on your architecture). Try looking for “regression network” examples.

Upvotes: 1

Related Questions