Marco Favorito
Marco Favorito

Reputation: 410

KNeighborClassifier with non-numeric data fails

I'm trying to train a KNeighborClassifier with non-numeric data, but I'm supplying a custom metric that allows to compute a similarity score between samples.

from sklearn.neighbors import KNeighborsClassifier

#Compute the "ASCII" distance:   
def my_metric(a,b):
    return ord(a)-ord(b)

#Samples and labels
X = [["a"],["b"], ["c"],["m"], ["z"]]

#S=Start of the alphabet, M=Middle, E=end
y = ["S", "S", "S", "M", "E"]

model = KNeighborsClassifier(metric=my_metric)
model.fit(X,y)

X_test = [["e"],["f"],["w"]]
y_test = [["S"],["M"],["E"]]
model.score(X_test, y_test)

I get the following error:

Traceback (most recent call last):
File "/home/marcofavorito/virtualenvs/nlp/lib/python3.5/site-packages/IPython/core/interactiveshell.py", line 2862, in run_code
  exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-20-e339c96eea22>", line 1, in <module>
  model.score(X_test, y_test)
File "/home/marcofavorito/virtualenvs/nlp/lib/python3.5/site-packages/sklearn/base.py", line 350, in score
  return accuracy_score(y, self.predict(X), sample_weight=sample_weight)
File "/home/marcofavorito/virtualenvs/nlp/lib/python3.5/site-packages/sklearn/neighbors/classification.py", line 145, in predict
  neigh_dist, neigh_ind = self.kneighbors(X)
File "/home/marcofavorito/virtualenvs/nlp/lib/python3.5/site-packages/sklearn/neighbors/base.py", line 361, in kneighbors
  **self.effective_metric_params_)
File "/home/marcofavorito/virtualenvs/nlp/lib/python3.5/site-packages/sklearn/metrics/pairwise.py", line 1247, in pairwise_distances
  return _parallel_pairwise(X, Y, func, n_jobs, **kwds)
File "/home/marcofavorito/virtualenvs/nlp/lib/python3.5/site-packages/sklearn/metrics/pairwise.py", line 1090, in _parallel_pairwise
  return func(X, Y, **kwds)
File "/home/marcofavorito/virtualenvs/nlp/lib/python3.5/site-packages/sklearn/metrics/pairwise.py", line 1104, in _pairwise_callable
  X, Y = check_pairwise_arrays(X, Y)
File "/home/marcofavorito/virtualenvs/nlp/lib/python3.5/site-packages/sklearn/metrics/pairwise.py", line 110, in check_pairwise_arrays
  warn_on_dtype=warn_on_dtype, estimator=estimator)
File "/home/marcofavorito/virtualenvs/nlp/lib/python3.5/site-packages/sklearn/utils/validation.py", line 402, in check_array
  array = np.array(array, dtype=dtype, order=order, copy=copy)
ValueError: could not convert string to float: 'e'

I guess I can implement the algorithm very easily, but without all the feature of the sklearn classifier. I'm missing some option? Or simply I cannot train the model if before I don't translate samples into floats?

N.B. I know that the problem can be easily solved by put numbers instead of characters. But I need to solve another problem which deal with non-numeric data and where I cannot find a simple mapping to floats, as said before.

Upvotes: 2

Views: 1941

Answers (2)

sascha
sascha

Reputation: 33532

Apart from the stuff which Mohammed already mentioned: your approach is mathematically flawed and sklearn probably gives no guarantees what will happen.

KNN-classifier is just a nice-wrapper for the core data-structures like KD-trees and Ball-trees. Here you can see what kind of assumptions those need.

Here func is a function which takes two one-dimensional numpy arrays, and returns a distance. Note that in order to be used within the BallTree, the distance must be a true metric: i.e. it must satisfy the following properties

Non-negativity: d(x, y) >= 0

Identity: d(x, y) = 0 if and only if x == y

Symmetry: d(x, y) = d(y, x)

Triangle Inequality: d(x, y) + d(y, z) >= d(x, z)

To be fair. That's just what a metric is.

Stated like that, your metric is not a real metric! (Even the most obvious rule: non-negativity is not given.

Now in the text above, only a warning in regards to Ball-tree (not KD-tree) is given, and KNN choses the underlying tree-structure automatically. So there might be a bad case here, which you should avoid.

I'm not sure though, if these assumptions are needed for KD-tree too! I would have expected a yes and just point to kd-trees docs which are still using the word metric and the available kd_tree.valid_metrics (although this list is just a subset of the common metrics which come with sklearn).

Upvotes: 2

Gambit1614
Gambit1614

Reputation: 8801

There are a few errors in your code. First is that you have to somehow convert the categorical data into numerical ones. The KNN classifier in Sklearn does not support categorical data yet. Secondly, you need to use make_Scorer() function in sklearn in order to use your custom metrics. The default score function in a KNN returns the mean accuracy not the metric you specified. You can read more about it here. You need to change your dataset in order to use this sklearn implementation of KNN Classifier.

Upvotes: 1

Related Questions