Sreekar
Sreekar

Reputation: 1015

How to preprocess this floating point data to use with scikit - Machine Learning

I have dataset with 4000 features and 35 samples. All the features are floating point numbers between 1 and 3. eg: 2.68244527684596.

I'm struggling to get any classifier working on this data. I have used knn, svm (with linear,rbf,poly). Then I have learnt about normalization. Still, it's a bit complex for me and I cannot get this code working and giving me proper prediction.

The code I'm using to normalize data is:

train_data = preprocessing.scale(train_data)
train_data = preprocessing.normalize(train_data,'l1',0)

The code I'm trying to classify with is:

# SVM with poly
svc1 = svm.SVC(kernel='poly',degree=3)
svc1.fit(train_data[:-5], train_labels[:-5])
print "Poly SVM: ",svc1.predict(train_data[-5:])

# SVM with rbf
svc2 = svm.SVC(kernel='rbf')
svc2.fit(train_data[:-5], train_labels[:-5])
print "RBF SVM: ",svc2.predict(train_data[-5:])

#SVM with linear
svc3 = svm.SVC(kernel='linear')
svc3.fit(train_data[:-5], train_labels[:-5])
print "Linear SVM: ",svc3.predict(train_data[-5:])


# KNN
knn = KNeighborsClassifier()
knn.fit(train_data[:-5], train_labels[:-5])
print "KNN :", knn.predict(train_data[-5:])

# Linear regression
logistic = linear_model.LogisticRegression()
print('LogisticRegression score: %f' % logistic.fit(train_data[5:], train_labels[5:]).score(train_data[0:4], train_labels[0:4]))

I'm a newbie to machine learning and I'm working hard to learn more about all the concepts. I thought someone might point me in the right direction.

Note: I have only 35 samples and this is part of an assignment. I cannot get more data :(

Upvotes: 0

Views: 1477

Answers (1)

lejlot
lejlot

Reputation: 66805

If your data is not specific in any sense, then the standarization preprocessing.scale should be just fine. It forces each dimension to have 0-mean and standard deviation 1, so more or less it tries to enclose data in a 0-centered ball. It is worth noting that you should not use normalize, normalize forces each sample to have a unit norm, it has to be justified by your data (as you force your points to be placed on the sphere then). It is rarely the case.

There might be dozens of reasons why your classifiers do not work. In particular - is it your testing code? If so:

  • you should not test just on 5 samples, learn about cross-validation (available in scikit-learn) and run at least 3-fold CV
  • learn and test various hyperparameters. SVM requires at least for of them (depending on the kernel used, usually from 1 to 4 - for RBF kernel its C and gamma, for poly C, degree, coef0, ...); KNN around 3 (k, metric, weights); Logistic regression at least 1 (regularization)
  • before building classifiers - look at your data. Plot it on a plane (PCA), try plotting projections on each feature - what are characteristics of your data? Is it balanced?
  • most importantly - gather more data! You have 35 points in 4000 dimensional space... it is ridiculously small number of samples to do anything... if you are nto able to get at least 10x (preferably 100x) more points - start with reducing dimensionality of your data to at most 30 dimensions... use dimensionality reduction (even scikit-learn PCA).

Upvotes: 2

Related Questions