veksev
veksev

Reputation: 90

How to select and use features of varying datatypes?

I'm a complete newbie to machine learning and while I have some sci-kit classifiers "working" on my dataset I'm not sure if I'm using them correctly. I'm doing supervised learning with a hand labeled training set.

The problem is: each item in my data set is a dictionary with approx. 80 keys that are either text, boolean, or integers that I want to use as features. I have about 40,000 items and have hand labeled about 800 of them. Am I meant to select, for example, only boolean features to use, or only integers? Do I need to normalize the features (remove mean + scale to unit variance)? I'm currently not even going to attempt analysis of the text yet so it may be worth not even giving those features to the classifier. Would it be dumb to just try various permutations/combinations of features of the same type (ints)? It could also be that I'm approaching my dataset completely wrong... it's shaped like this:

[ [a, b, c, ...], [a, b, c, ...], [a, b, c, ...], ...]

Essentially what I hope to achieve is a binary classification of each item in the dataset, basically just "Good" or "Bad" according to what I've hand labeled. I read that some classifiers work better on different data types, like Bernoulli Naive Bayes, and K Nearest Neighbors works when the "decision boundary is very irregular".

Ultimately I want a comparison of classifier accuracy across several different algorithms, in addition to hopefully isolating one that is actually accurate for classifying my data...

Upvotes: 1

Views: 91

Answers (1)

Andreas Mueller
Andreas Mueller

Reputation: 28748

All classifiers in scikit-learn require numeric data. Boolean features are fine, for integer features it depends on whether they encode categorical, ordinal or numeric data.

The preprocessing you need to do depends on the type of feature, not on whether you want to combine them. Combining them is probably a good idea.

You can do a simple transformation for the text data using CountVectorizer or TFIDFVectorizer.

Upvotes: 3

Related Questions