theupandup
theupandup

Reputation: 335

Efficient pairwise calculations with pandas

Given some rows of categorical data, I want to calculate a pairwise matrix with the number of differences between those rows.

For example, comparing a row with the values [1, 0, 0, 1] to a row with values [0, 0, 1, 1] would give a resulting value of 2, because indices 0 and 2 differ.

I want to make a matrix showing each pairwise combination rows. I wrote code for this, however it's very inefficient on large data. I know there must be a way to do this more efficiently, because only the top half of this matrix really needs to be calculated.

I don't know how to translate it to code, though. Here is what I have so far:

shortened = pd.DataFrame(
    [{'c1':1, 'c2':0, 'c3':0}, {'c1':1,'c2':1, 'c3':0}, {'c1':0,'c2':0, 'c3':1}]
)
distm = [[""]+ list(shortened.index)]
found = {}
for index,row in shortened.iterrows():
    newrow = [index]
    for i2,r2 in shortened.iterrows():
        if((i2,index) in found):
            newrow.append(found[(i2,index)])
            continue
        if(index == i2):
            newrow.append(0)
            continue
        summeddif = sum(i != j for i, j in zip(row, r2))
        newrow.append(summeddif)
        found[(index,i2)] = summeddif
    distm.append(newrow)

So with dataframe example here, the correct output is obtained:

  | 0 1 2
---------
0 | 0 1 2
1 | 1 0 3
2 | 2 3 0

However with a very large input this takes forever. Is there an elegant way to iterate over only the top half, and simply copy over to the bottom half so I don't need to make unnecessary comparisons? Or is there no other way to improve this through pandas?

Upvotes: 2

Views: 1061

Answers (1)

cs95
cs95

Reputation: 402483

Use broadcasted XOR.

(shortened.values ^ shortened.values[:, None]).sum(2)

array([[0, 1, 2],
       [1, 0, 3],
       [2, 3, 0]])

XOR is the easiest (and fastest) way of checking whether two bits are the same. This should work as long as your input is binary.

Note that this is memory intensive, especially for very large frames, with a chance of OOM at ~1M rows.

Upvotes: 2

Related Questions