Reputation: 11
I have dataframe like this:
apple aple apply
apple 0 0 0
aple 0 0 0
apply 0 0 0
I want to calculate string distance e.g apple -> aple etc. My end result is here:
apple aple apply
apple 0 32 14
aple 32 0 30
apply 14 30 0
Currently this is code i am using (but it's very slow for big data):
columns = df.columns
for r in columns:
for c in columns:
m[r][c] = Simhash(r).distance(Simhash(c))
can anyone help me to calculate distance efficiently ?
Upvotes: 1
Views: 350
Reputation: 52286
One thought - since the output is symmetrical, by iterating over every pair you are calculating each pair twice. Also, you can skip the comparison between an element and itself. So to at least cut down on the number of calculations, you could do something like this - using itertools to only calculate the distance for pairs, and then using pandas to fill in the rest.
from itertools import combinations
from collections import defaultdict
data = df.index
output = defaultdict(dict)
for a,b in combinations(data, 2):
output[a][b] = Simhash(a).distance(Simhash(b))
for a in data:
output[a][a] = 0
df = pd.DataFrame(output)
df = df.fillna(df.T)
You'd have to test on a bigger frame, but I think it would be faster than what you're doing, and should give the same answer.
In [84]: df
Out[84]:
aple apple apply
aple 0 32 30
apple 32 0 14
apply 30 14 0
Upvotes: 1