Fastest way to perform comparisons between every column using Pandas

Question

I have an excel file with 100 columns each with 1000 entries. Each of those entries can only take 3 particular values of (0.8, 0.0 and 0.37) I want to count the number of mismatches between every combination of two column's entry.

For example, the excel sheet below shows the mismatches between the columns:

|---------------------|------------------|---------------------|---------------|
|      Column 1       |     Column 2     |      Column 3       |     Mismatch  |
|---------------------|------------------|---------------------|---------------|
|          0.37       |         0.8      |          0.0        |         3     |
|---------------------|------------------|---------------------|---------------|
|          0.0        |         0.0      |          0.8        |         2     |
|---------------------|------------------|---------------------|---------------|

First we compare column 1 against column 2. Since there is a difference between the first rows we add 1 to the corresponding row of the mismatch column. We repeat this for column 1 vs column 3 and then column 2 vs column 3. So we need to iterate over every unique combination of two columns.

The brute force way of doing this is a nested loop which iterates over two columns at a time. I was wondering if there is a panda-y way of doing this.

ALollz · Accepted Answer

Since you sum pairwise combinations it's the same as checking the first column against the second through the last columns, the second against the third through the last and so on. Checking N-1 (N number of columns) equalities against the DataFrame and summing will be quite a bit faster than checking NC2 individual column pairings, especially with your large number of columns:

from functools import reduce

reduce(lambda x,y: x+y, [df.iloc[:, i+1:].ne(df.iloc[:, i], axis=0).sum(1) 
                          for i in range(len(df.columns)-1)])
0    3
1    2
dtype: int64

Some timings with your data size

import numpy as np
import pandas as pd
from itertools import combinations

np.random.seed(123)
df = pd.DataFrame(np.random.choice([0, 0.8, 0.37], (1000,100)))

%timeit reduce(lambda x, y: x+y, [df.iloc[:, i+1:].ne(df.iloc[:, i], axis=0).sum(1) for i in range(len(df.columns)-1)])
#157 ms ± 659 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit pd.concat([df[x[0]]!=df[x[1]] for x in list( combinations(L, 2))],axis=1).sum(1)
#1.55 s ± 9.93 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

You can gain slightly using numpy and summing the values though you lose the index:

%timeit np.sum([df.iloc[:, i+1:].ne(df.iloc[:, i], axis=0).sum(1).to_numpy() for i in range(len(df.columns)-1)], axis=0)
#139 ms ± 715 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Fastest way to perform comparisons between every column using Pandas

Answers (2)

Related Questions