python & pandas: iterating over DataFrame twice

Question

Doing a mahalanobis calculation for each row of a DataFrame with distances to every other row in the DataFrame. It kind of looks like this:

import pandas as pd
from scipy import linalg
from scipy.spatial.distance import mahalanobis
from pprint import pprint

testa = { 'pid': 'testa', 'a': 25, 'b': .455, 'c': .375 }
testb = { 'pid': 'testb', 'a': 22, 'b': .422, 'c': .402 }
testc = { 'pid': 'testc', 'a': 11, 'b': .389, 'c': .391 }

cats = ['a','b','c']
pids = pd.DataFrame([ testa, testb, testc ])
inverse = linalg.inv(pids[cats].cov().values)
distances = { pid: {} for pid in pids['pid'].tolist() }

for i, p in pids.iterrows():
    pid = p['pid']
    others = pids.loc[pids['pid'] != pid]
    for x, other in others.iterrows():
        otherpid = other['pid']
        d = mahalanobis(p[cats], other[cats], inverse) ** 2
        distances[pid][otherpid] = d

pprint(distances)

It works fine for the three test cases here, but in real life it's going to run against around 2000-3000 rows, and using this approach takes too long. I'm relatively new to pandas and I really prefer python to R, so I'd like to have this cleaned up.

How can I make this more efficient?

python & pandas: iterating over DataFrame twice

Answers (1)

Related Questions

python &amp; pandas: iterating over DataFrame twice

Answers (1)

Related Questions

python & pandas: iterating over DataFrame twice