Reputation: 2618
I have a Panda Dataframe with two columns (Word and Word_Position) in it. I need to find the distance between words and present the output in matrix form for better readability.
What I have done so far is I have created a row matrix from the DF.Word_Position column and transposed it to create a column matrix. When I subtracted both these matrices, I am getting few values with minus sign before them.
With all due respect to the great mathematics, this is absolutely correct but for my requirement I just need the number and not the minus sign.
Is there any other better way to do the same ? Appreciating your help. Thanks in advance.
Note : I am using Python 3.6
Code snippets and its corresponding output for your reference
m1 = np.matrix(df1['Word Position'])
print(m1)
[[ 1 2 3 ..., 19 20 21]]
m2 = np.matrix(m1.T)
print(m2)
[[ 1]
[ 2]
[ 3]
...,
[19]
[20]
[21]]
print(m2-m1)
[[ 0 -1 -2 ..., -18 -19 -20]
[ 1 0 -1 ..., -17 -18 -19]
[ 2 1 0 ..., -16 -17 -18]
...,
[ 18 17 16 ..., 0 -1 -2]
[ 19 18 17 ..., 1 0 -1]
[ 20 19 18 ..., 2 1 0]]
Upvotes: 4
Views: 17663
Reputation: 6528
If you want the distance between to arrays, the proper way is to compute the norm:
dists = [np.linalg.norm(m - m2, axis=1) for m in m1[0]]
This assume that shape of the arrays are
(n_sample, n_dimension)
.Instead of list comprehension, you can do numpy broadcasting on m2
I you want more control on the metric you might want to use scipy.spatial.distance.cdist. This option is faster with large arrays. An example with the minkowski distance (p=2 for Euclidean distance):
dists = [scipy.spatial.distance.cdist(m, m2, 'minkowski', p) for m in m1]
Of course, if the array is only 1D you can achieve that using an absolute value:
dists = np.abs(m1 - m2)
Upvotes: 1
Reputation: 14399
In this case, you probably want to use scipy.spatial.distance.pdist
from scipy.spatial.distance import squareform, pdist
m = df1['Word Position'].data[:, None]
dist = squareform(pdist(m, 'minkowksi', 1))
A bit overkill for this, but extensible if you ever want to change your distance parameter, and usually faster than broadcasting (since it only does half the subtraction steps as abs(a-b) == abs(b-a)
). If you want to do broadcasting you could always do this:
dist = np.abs(m - m.T)
Upvotes: 1
Reputation: 109546
Just take the absolute value?
np.abs(m2 - m1)
Your code indicates that your data consists of numpy arrays, so the solution above should work.
If they are dataframes, you could do:
m2.sub(m1).abs()
Upvotes: 5