user3579222
user3579222

Reputation: 1430

Sklearn: Nearest Neightbour with String-Values and Custom Metric

I have data that looks like the following (all are string values)

>>> all_states[0:3]
[['A','B','Empty'],
 ['A', 'B', 'Empty'],
 ['C', 'D', 'Empty']]

I want to use a custom distance metric

def mydist(x, y):
    return 1
neigh = NearestNeighbors(n_neighbors=5, metric=mydist)

However, when I call

neigh.fit(np.array(all_states))

I get the error

ValueError: Unable to convert array of bytes/strings into decimal numbers with dtype='numeric'

I know that I can use the OneHotEncoder or the LabelEncoder - but can I also do that without encoding the data as I have my own distance metric?

Upvotes: 3

Views: 678

Answers (3)

Kaiwen
Kaiwen

Reputation: 367

Also note that to use neigh.kneighbors with metric='precomputed' and custom query points, pass cdist(query_points, all_states) to it (cdist doc). For example,

from scipy.spatial.distance import cdist

... # initialize and fit `neigh` as in @StupidWolf's answer
print(neigh.kneighbors(cdist(query_points, all_states)))

Upvotes: 0

StupidWolf
StupidWolf

Reputation: 46968

On the help page,

metrics tr or callable, default=’minkowski’

The distance metric to usefor the tree. The default metric is minkowski, and with p=2 is equivalent to the standard Euclidean metric. See the documentation of DistanceMetric for a list of available metrics. If metric is “precomputed”, X is assumed to be a distance matrix and must be square during fit. X may be a sparse graph, in which case only “nonzero” elements may be considered neighbors.

You can use pdist documentation and make it squareform as required for the input:

all_states = [['A','B','Empty'],
 ['A', 'B', 'Empty'],
 ['C', 'D', 'Empty']]

from scipy.spatial.distance import pdist,squareform
from sklearn.neighbors import NearestNeighbors

dm = squareform(pdist(all_states, mydist))
dm

array([[0., 1., 1.],
       [1., 0., 1.],
       [1., 1., 0.]])

neigh = NearestNeighbors(n_neighbors=5, metric="precomputed")  
neigh.fit(dm)

Upvotes: 3

Lusthetics
Lusthetics

Reputation: 16

As far as I know, ML models need to be trained on numerical data. If your distance metric has a way to convert your strings to numbers, then it will work.

Upvotes: 0

Related Questions