Shakeel Mumtaz
Shakeel Mumtaz

Reputation: 11

Pandas: matrix calculation on values

I have dataframe like this:

        apple aple  apply
apple     0     0      0
aple      0     0      0
apply     0     0      0

I want to calculate string distance e.g apple -> aple etc. My end result is here:

        apple aple  apply
apple     0     32     14
aple      32    0      30
apply     14    30     0

Currently this is code i am using (but it's very slow for big data):

columns = df.columns
for r in columns:
  for c in columns:
     m[r][c] = Simhash(r).distance(Simhash(c)) 

can anyone help me to calculate distance efficiently ?

Upvotes: 1

Views: 350

Answers (1)

chrisb
chrisb

Reputation: 52286

One thought - since the output is symmetrical, by iterating over every pair you are calculating each pair twice. Also, you can skip the comparison between an element and itself. So to at least cut down on the number of calculations, you could do something like this - using itertools to only calculate the distance for pairs, and then using pandas to fill in the rest.

from itertools import combinations
from collections import defaultdict

data = df.index

output = defaultdict(dict)

for a,b in combinations(data, 2):
    output[a][b] = Simhash(a).distance(Simhash(b))
for a in data:
    output[a][a] = 0

df = pd.DataFrame(output)

df = df.fillna(df.T)

You'd have to test on a bigger frame, but I think it would be faster than what you're doing, and should give the same answer.

In [84]: df
Out[84]: 
       aple  apple  apply
aple      0     32     30
apple    32      0     14
apply    30     14      0

Upvotes: 1

Related Questions