Calculate euclidean distance from dicts (sklearn)

I have two dictionaries already calculated in my code, which look like this:

X = {'a': 10, 'b': 3, 'c': 5, ...}
Y = {'a': 8, 'c': 3, 'e': 8, ...}

Actually they contain words from wiki texts, but this should serve to show what I mean. They don't necessarily contain the same keys.

Initially I wanted to use sklearn's pairwise metric like this:

from sklearn.metrics.pairwise import pairwise_distances

obama = wiki[wiki['name'] == 'Barack Obama']['tf_idf'][0]
biden = wiki[wiki['name'] == 'Joe Biden']['tf_idf'][0]

obama_biden_distance = pairwise_distances(obama, biden, metric='euclidean', n_jobs=2)[0][0]

However, this gives an error:

--------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-124-7ff03bd40683> in <module>()
      6 biden = wiki[wiki['name'] == 'Joe Biden']['tf_idf'][0]
      7 
----> 8 obama_biden_distance = pairwise_distances(obama, biden, metric='euclidean', n_jobs=2)[0][0]

/home/xiaolong/development/anaconda3/envs/coursera_ml_clustering_and_retrieval/lib/python3.4/site-packages/sklearn/metrics/pairwise.py in pairwise_distances(X, Y, metric, n_jobs, **kwds)
   1205         func = partial(distance.cdist, metric=metric, **kwds)
   1206 
-> 1207     return _parallel_pairwise(X, Y, func, n_jobs, **kwds)
   1208 
   1209 

/home/xiaolong/development/anaconda3/envs/coursera_ml_clustering_and_retrieval/lib/python3.4/site-packages/sklearn/metrics/pairwise.py in _parallel_pairwise(X, Y, func, n_jobs, **kwds)
   1058     ret = Parallel(n_jobs=n_jobs, verbose=0)(
   1059         fd(X, Y[s], **kwds)
-> 1060         for s in gen_even_slices(Y.shape[0], n_jobs))
   1061 
   1062     return np.hstack(ret)

AttributeError: 'dict' object has no attribute 'shape'

To me this reads like something is trying to access the shape attribute, which a dict doesn't have. I guess it needs numpy arrays. How can I transform the dictionaries, so that the sklearn function will compute the correct distance, assuming 0 values, if a dictionary does not have a certain key, which the other dictionary has?

Upvotes: 2

Answers (3)

Tonechas

Reputation: 13743

You could start by creating a list with all the keys of your dictionaries (it is important to note that this list has to be sorted):

X = {'a': 10, 'b': 3, 'c': 5}
Y = {'a': 8, 'c': 3, 'e': 8}
data = [X, Y]
words = sorted(list(reduce(set.union, map(set, data))))

This works fine in Python 2, but if you are using Python 3 you'll need to add the sentence from functools import reduce (thanks to @Zelphir for spotting this). If you don't wish to import the functools module you can replace the last line of the snippet above by the following code:

words = set(data[0])
for d in data[1:]:
    words = words | set(d)
words = sorted(list(words))

Whatever it is the method you choose, the list words makes it possible to set-up a matrix in which each row corresponds to a dictionary (a sample) and the values of those dictionaries (features) are placed in the column corresponding to its key.

feats = zip(*[[d.get(w, 0) for d in data] for w in words])

This matrix can be passed to scikit's function pairwise_distance:

from sklearn.metrics.pairwise import pairwise_distances as pd
dist = pd(feats, metric='euclidean')

The following interactive session demonstrates how it works:

In [227]: words
Out[227]: ['a', 'b', 'c', 'e']

In [228]: feats
Out[228]: [(10, 3, 5, 0), (8, 0, 3, 8)]

In [229]: dist
Out[229]: 
array([[ 0.,  9.],
       [ 9.,  0.]])

Finally, you could wrap the code above into a function to compute the pairwise distance of any number of dictionaries:

def my_func(data, metric='euclidean'):
    words = set(data[0])
    for d in data[1:]:
        words = words | set(d)
    words = sorted(list(words))
    feats = zip(*[[d.get(w, 0) for d in data] for w in words])
    return pd(feats, metric=metric)

I have avoided the call to reduce in order for the wrapper to work across versions.

Demo:

In [237]: W = {'w': 1}

In [238]: Z = {'z': 1}

In [239]: my_func((X, Y, W, Z), 'cityblock')
Out[239]: 
array([[  0.,  15.,  19.,  19.],
       [ 15.,   0.,  20.,  20.],
       [ 19.,  20.,   0.,   2.],
       [ 19.,  20.,   2.,   0.]])

Upvotes: 3

Daniel F

Reputation: 14399

Seems like you'd want to use X.get(search_string,0), which would output the value or 0 if not found. If you have a lot of search strings you could do [X.get(s,0) for s in list_of_strings] which will push a list of output.

Upvotes: 0

juanpa.arrivillaga

Reputation: 96267

Why don't you just do it directly from your sparse representation?

In [1]: import math

In [2]: Y = {'a': 8, 'c':3,'e':8}

In [3]: X = {'a':10, 'b':3, 'c':5}

In [4]: math.sqrt(sum((X.get(d,0) - Y.get(d,0))**2 for d in set(X) | set(Y)))
Out[4]: 9.0

Upvotes: 6

Calculate euclidean distance from dicts (sklearn)

Answers (3)

Related Questions