Reputation: 5498
The following code allows me to find matching values:
from itertools import combinations
df = pd.DataFrame(np.random.randn(12, 3), columns=['A', 'B', 'C'])
thresh = .3
df['matches'] = sum(abs(df[k1]-df[k2])<thresh for k1,k2 in combinations(df.keys(),2))
Info about this can be found here:
Better explanation is from @Andras Deak
"The generator expression in the sum()
loops over every column pair, and constructs the respective boolean vector. These are summed for each column pair, and the resulting column is appended to the dataframe".
Example output for thresh = 0.3:
A B C matches
0 0.146360 -0.099707 0.633632 1
1 1.462810 -0.186317 -1.411988 0
2 0.358827 -0.758619 0.038329 0
3 0.077122 -0.213856 -0.619768 1
4 0.215555 1.930888 -0.488517 0
5 -0.946557 -0.904743 -0.004738 1
6 -0.080209 -0.850830 -0.866865 1
7 -0.997710 -0.580679 -2.231168 0
8 1.762313 -0.356464 -1.813028 0
9 1.151338 0.347636 -1.323791 0
10 0.248432 1.265484 0.048484 1
11 0.559934 -0.401059 0.863616 0
How can I return another column df['matches_mean']
providing the mean of the values that have matched? So for the first row in the example above it would return the mean of 0.146360
& -0.099707
. I would like this to use the same itertools combinations
logic as the original code as this scales well on my real data.
Upvotes: 0
Views: 90
Reputation: 35125
For the time being here is an almost-perfect solution. The problem is that when you want to calculate the mean for multiple matches (say, all three numbers in a row are close within threshold), you need to take the proper number of values during computing the mean. This is easy in case if there are only 1 or 3 matches out of 3 columns. But when two pairs of numbers are "close" but the third pair is not, the below code will have an error.
The idea is to sum up the matching values for each row and each combination. If only, say, row 'A'
and 'B'
are "close" (within threshold), we get df.A+df.B
, which has to be divided by 2 to get the mean. When all three are "close", we get 2*df.A + 2*df.B + 2*df.C
, which can be divided by 6 to get a proper mean. However, in the third case we get, for instance, df.A + 2*df.B + df.C
(when A and B are close, B and C are close, but A and C are not). In this case, we can't divide by anything to get the proper mean. We should probably divide by 4 to get "roughly one" element's worth of erroneous mean. My point is that the code would be much more complicated if we were to treat this case properly, and depending on your needs it might not be worth it. It's also unclear how exactly you want to handle this case. The current version of dividing the above by 4 is equivalent to averaging A vs B, averaging B vs C, then averaging these average values again.
So here goes:
import numpy as np
import pandas as pd
from itertools import combinations
colnames = ['A', 'B', 'C']
df = pd.DataFrame(np.random.randn(12, 3), columns=colnames)
thresh = .3
df['matches'] = sum(abs(df[k1]-df[k2])<thresh
for k1,k2 in combinations(colnames,2))
# this is your starting point, we'll need df['matches'] too
tmpsums = sum(np.where(abs(df[k1]-df[k2])<thresh,df[k1]+df[k2],0)
for k1,k2 in combinations(colnames,2))
# divide by 2/4/6:
df['matches_mean'] = np.where(df['matches'],tmpsums/df['matches']/2,0)
The appearing sum over a generator expression is reaching the limits of good taste, I admit. You might want to write it out in a proper for loop instead, but then you'll have to incrementally sum up the values in tmpsums
. I admit that this might be prettier.
Anyway, this second generator expression works similarly to the first one. The yielded value is of course different, it is
np.where(abs(df[k1]-df[k2])<thresh,df[k1]+df[k2],0)
That is, it will give us the sum of the elements for the given column pair if those values are closer than thresh, otherwise we get 0. For all 3 combinations we get an array with such zero or sum-of-two-elements values, and we sum these up again. Where there were 0 matches, we get 0. Where there was 1 match, we sum up the two matching elements. For 2 matches, we get the mixed sum that I mentioned earlier, and we have all terms twice in case of 3 matches.
What remains is to divide the nonzero cases by the number of matches, which is simply a division with twice the number of matches which we already know (but we have to watch out for division-by-zero).
Example output with thresh = 0.3
:
A B C matches matches_mean
0 0.716278 0.681279 0.861410 3 0.752989
1 -0.109029 -0.646952 0.268038 0 0.000000
2 -1.095221 -1.088397 1.100645 1 -1.091809
3 -1.970372 -0.367096 -0.337098 1 -0.352097
4 -1.030003 0.082001 -0.807431 1 -0.918717
5 1.660611 -0.046429 0.557107 0 0.000000
6 -0.508715 -0.588217 0.014917 1 -0.548466
7 0.578028 -0.187097 -0.420243 1 -0.303670
8 0.233687 1.311917 1.888947 0 0.000000
9 0.478863 1.087957 -0.897025 0 0.000000
10 -0.001462 0.866320 -1.198642 0 0.000000
11 0.297946 0.564325 -1.098887 1 0.431135
Upvotes: 1