cs95
cs95

Reputation: 402942

ML for numerical transformations

I have a small dataset of size 200. The dataset is very simple: each row consists of a real-valued number in the range [0, 1] that maps to a single label. There are 24 labels in total, and the essence of my task would be to train a classifier to basically find a range that maps to a label.

There are 2 approaches I can think of. The first one would be SVCs because of their ability to separate the input plane into 24 regions, which is what I need. However, when I tried coding it, I ended up with some terrible results: the classifier did not learn anything, and would spit out the same label regardless of the input value.

The second approach I am considering is a neural network, but given the lack of features and training data, I highly doubt the feasibility of this approach.

If requested, I can share my SVC code that I developed with scikit-learn.

Here's a look at my data that I've dumped onto the terminal:

Label: Min, Mean, Max
{0: [0.96, 0.98, 1.0],
 1: [0.15, 0.36, 0.92],
 2: [0.14, 0.56, 0.98],
 3: [0.37, 0.7, 1.0],
 4: [0.23, 0.23, 0.23],
 6: [0.41, 0.63, 0.97],
 7: [0.13, 0.38, 0.61],
 8: [0.11, 0.68, 1.0],
 9: [0.09, 0.51, 1.0],
 10: [0.19, 0.61, 0.97],
 11: [0.26, 0.41, 0.57],
 12: [0.29, 0.72, 0.95],
 13: [0.63, 0.9, 0.99],
 14: [0.06, 0.55, 1.0],
 15: [0.1, 0.64, 1.0],
 16: [0.26, 0.58, 0.95],
 17: [0.29, 0.88, 1.0],
 21: [0.58, 0.79, 1.0],
 22: [0.24, 0.59, 0.94],
 23: [0.12, 0.62, 0.95]}

As you can see, the data is all over the place, but I want to find out whether it is possible to find a range that each label best represents.

I'd appreciate if someone could tell me if I'm on the right track or not. Thanks!

Upvotes: 0

Views: 85

Answers (2)

sascha
sascha

Reputation: 33532

If we assume, that your samples per class are somewhat centered (but still noisy; there can be overlappings), probably the most natural classifier available within sklearn is Gaussian Naive Bayes where we assume those points per class are following the normal-distribution.

Here is some code, which builds some fake data, classifies it and evaluate:

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score
np.random.seed(1)


""" Data-params + Data-generation """
N_CLASSES = 24
N_SAMPLES_PER_CLASS = 10
SIGMA = 0.01

class_centers = np.random.random(size=N_CLASSES)
# ugly code with bad numpy-style
X = []
for class_center in class_centers:
    samples = np.random.normal(size=N_SAMPLES_PER_CLASS)*SIGMA
    for sample in samples + class_center:
        X.append(sample)
Y = []
for ind, c in enumerate(class_centers):
    for s in range(N_SAMPLES_PER_CLASS):
        Y.append(ind)

X = np.array(X).reshape(-1, 1)
Y = np.array(Y)

""" Split & Fit & Eval """
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.1, random_state=0)

et = GaussianNB()
et.fit(X_train, y_train)

print('Prediction on test')
preds = et.predict(X_test)
print(preds)

print('Original samples')
print(y_test)

print('Accuracy-score')
print(accuracy_score(y_test, preds))

Output

Prediction on test
[10  7  3  7  8  3 23  3 11 19  7 20  8 15 11 13 18 11  3 16  8  9  8 12]
Original samples
[10  7  3  7 10 22 15 22 15 19  7 20  8 15 23 13 18 11 22  0 10 17  8 12]
Accuracy-score
0.583333333333

Of course the result is highly dependent on N_SAMPLES_PER_CLASS and SIGMA.

EDIT:

As you now presented your data, it's obvious that my assumption does not hold. See the following plot done by this code (file was stripped from [](); people should really post csv-compatible data!):

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

data = pd.read_csv('idVXjwgZ.txt', usecols=[0,1], names=['x', 'y'])
sns.swarmplot(data=data, x='y', y='x')
plt.show()

Plot:

enter image description here

Now just think about observing some x and you need to decide on y. Pretty hard for most x-ranges.

There is obviously also the class-balance problem which explains the output of class 14 for most predictions.

Upvotes: 2

Prune
Prune

Reputation: 77880

If the label ranges don't overlap, then this is not an ML problem; it's a simple list-sorting task. Sort the data on the real number; group by label. Within each label, take the min and max values; that's your range.

If you need partitions instead, then sort the ranges in order of their real values. For each pair of adjacent classes, take the median of the boundary values and make that the partition between the classes.

For instance, given the list of 12 values in 3 classes

(0.10, 3), (0.40, 2), (0.11, 3), (0.24, 1),
(0.20, 1), (0.21, 1), (0.12, 3), (0.41, 2),
(0.18, 3), (0.42, 2), (0.46, 2), (0.22, 1)

Sort the list by the first value in each pair:

(0.10, 3), (0.11, 3), (0.12, 3), (0.18, 3),
(0.20, 1), (0.21, 1), (0.22, 1), (0.24, 1),
(0.40, 2), (0.41, 2), (0.42, 2), (0.46, 2),

You now have a range for each label:

3 [0.10 - 0.18]
1 [0.20 - 0.24]
2 [0.40 - 0.46]

If you want partition values, just take the boundary averages, and you have the values 0.19 and 0.32 to separate your classes.

Upvotes: 1

Related Questions