ML for numerical transformations

Question

I have a small dataset of size 200. The dataset is very simple: each row consists of a real-valued number in the range [0, 1] that maps to a single label. There are 24 labels in total, and the essence of my task would be to train a classifier to basically find a range that maps to a label.

There are 2 approaches I can think of. The first one would be SVCs because of their ability to separate the input plane into 24 regions, which is what I need. However, when I tried coding it, I ended up with some terrible results: the classifier did not learn anything, and would spit out the same label regardless of the input value.

The second approach I am considering is a neural network, but given the lack of features and training data, I highly doubt the feasibility of this approach.

If requested, I can share my SVC code that I developed with scikit-learn.

Here's a look at my data that I've dumped onto the terminal:

Label: Min, Mean, Max
{0: [0.96, 0.98, 1.0],
 1: [0.15, 0.36, 0.92],
 2: [0.14, 0.56, 0.98],
 3: [0.37, 0.7, 1.0],
 4: [0.23, 0.23, 0.23],
 6: [0.41, 0.63, 0.97],
 7: [0.13, 0.38, 0.61],
 8: [0.11, 0.68, 1.0],
 9: [0.09, 0.51, 1.0],
 10: [0.19, 0.61, 0.97],
 11: [0.26, 0.41, 0.57],
 12: [0.29, 0.72, 0.95],
 13: [0.63, 0.9, 0.99],
 14: [0.06, 0.55, 1.0],
 15: [0.1, 0.64, 1.0],
 16: [0.26, 0.58, 0.95],
 17: [0.29, 0.88, 1.0],
 21: [0.58, 0.79, 1.0],
 22: [0.24, 0.59, 0.94],
 23: [0.12, 0.62, 0.95]}

As you can see, the data is all over the place, but I want to find out whether it is possible to find a range that each label best represents.

I'd appreciate if someone could tell me if I'm on the right track or not. Thanks!

sascha · Accepted Answer

If we assume, that your samples per class are somewhat centered (but still noisy; there can be overlappings), probably the most natural classifier available within sklearn is Gaussian Naive Bayes where we assume those points per class are following the normal-distribution.

Here is some code, which builds some fake data, classifies it and evaluate:

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score
np.random.seed(1)


""" Data-params + Data-generation """
N_CLASSES = 24
N_SAMPLES_PER_CLASS = 10
SIGMA = 0.01

class_centers = np.random.random(size=N_CLASSES)
# ugly code with bad numpy-style
X = []
for class_center in class_centers:
    samples = np.random.normal(size=N_SAMPLES_PER_CLASS)*SIGMA
    for sample in samples + class_center:
        X.append(sample)
Y = []
for ind, c in enumerate(class_centers):
    for s in range(N_SAMPLES_PER_CLASS):
        Y.append(ind)

X = np.array(X).reshape(-1, 1)
Y = np.array(Y)

""" Split & Fit & Eval """
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.1, random_state=0)

et = GaussianNB()
et.fit(X_train, y_train)

print('Prediction on test')
preds = et.predict(X_test)
print(preds)

print('Original samples')
print(y_test)

print('Accuracy-score')
print(accuracy_score(y_test, preds))

Output

Prediction on test
[10  7  3  7  8  3 23  3 11 19  7 20  8 15 11 13 18 11  3 16  8  9  8 12]
Original samples
[10  7  3  7 10 22 15 22 15 19  7 20  8 15 23 13 18 11 22  0 10 17  8 12]
Accuracy-score
0.583333333333

Of course the result is highly dependent on N_SAMPLES_PER_CLASS and SIGMA.

EDIT:

As you now presented your data, it's obvious that my assumption does not hold. See the following plot done by this code (file was stripped from [](); people should really post csv-compatible data!):

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

data = pd.read_csv('idVXjwgZ.txt', usecols=[0,1], names=['x', 'y'])
sns.swarmplot(data=data, x='y', y='x')
plt.show()

Plot:

Now just think about observing some x and you need to decide on y. Pretty hard for most x-ranges.

There is obviously also the class-balance problem which explains the output of class 14 for most predictions.

ML for numerical transformations

Answers (2)

Related Questions