Reputation: 738
I have an excel file with 100 columns each with 1000 entries. Each of those entries can only take 3 particular values of (0.8, 0.0 and 0.37) I want to count the number of mismatches between every combination of two column's entry.
For example, the excel sheet below shows the mismatches between the columns:
|---------------------|------------------|---------------------|---------------|
| Column 1 | Column 2 | Column 3 | Mismatch |
|---------------------|------------------|---------------------|---------------|
| 0.37 | 0.8 | 0.0 | 3 |
|---------------------|------------------|---------------------|---------------|
| 0.0 | 0.0 | 0.8 | 2 |
|---------------------|------------------|---------------------|---------------|
First we compare column 1 against column 2. Since there is a difference between the first rows we add 1 to the corresponding row of the mismatch column. We repeat this for column 1 vs column 3 and then column 2 vs column 3. So we need to iterate over every unique combination of two columns.
The brute force way of doing this is a nested loop which iterates over two columns at a time. I was wondering if there is a panda-y way of doing this.
Upvotes: 1
Views: 641
Reputation: 59529
Since you sum pairwise combinations it's the same as checking the first column against the second through the last columns, the second against the third through the last and so on. Checking N-1
(N
number of columns) equalities against the DataFrame
and summing will be quite a bit faster than checking NC2
individual column pairings, especially with your large number of columns:
from functools import reduce
reduce(lambda x,y: x+y, [df.iloc[:, i+1:].ne(df.iloc[:, i], axis=0).sum(1)
for i in range(len(df.columns)-1)])
0 3
1 2
dtype: int64
Some timings with your data size
import numpy as np
import pandas as pd
from itertools import combinations
np.random.seed(123)
df = pd.DataFrame(np.random.choice([0, 0.8, 0.37], (1000,100)))
%timeit reduce(lambda x, y: x+y, [df.iloc[:, i+1:].ne(df.iloc[:, i], axis=0).sum(1) for i in range(len(df.columns)-1)])
#157 ms ± 659 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit pd.concat([df[x[0]]!=df[x[1]] for x in list( combinations(L, 2))],axis=1).sum(1)
#1.55 s ± 9.93 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
You can gain slightly using numpy
and summing the values though you lose the index:
%timeit np.sum([df.iloc[:, i+1:].ne(df.iloc[:, i], axis=0).sum(1).to_numpy() for i in range(len(df.columns)-1)], axis=0)
#139 ms ± 715 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Upvotes: 1
Reputation: 323226
This is what I will handle this problem
from itertools import combinations
L = df.columns.tolist()
pd.concat([df[x[0]]!=df[x[1]] for x in list( combinations(L, 2))],axis=1).sum(1)
0 3
1 2
dtype: int64
Upvotes: 3