Reputation: 69
df = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'bar',
'foo', 'bar', 'foo', 'foo'],
'B': ['one', 'one', 'two', 'three',
'two', 'two', 'one', 'three'],
'C': [3,4,5,8,10,12,14,12]})
df.groupby(['A','B']).sum()
How to find if the values in the C column is repeated in other groups as well? (Here 12 is repeated in both the groups)
Upvotes: 4
Views: 121
Reputation: 862481
Idea is convert MultIndex
to 3 columns DataFrame
, then DataFrame.pivot
with remove non duplicated rows by DataFrame.dropna
and common values are in index:
df = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'bar',
'foo', 'bar', 'foo', 'foo'],
'B': ['one', 'one', 'two', 'three',
'two', 'two', 'one', 'three'],
'C': [3,4,5,8,10,12,14,12]})
df = df.groupby(['A','B']).sum()
common = df.reset_index().pivot('C','A','B').dropna().index
print (common)
Int64Index([12], dtype='int64', name='C')
Then if want filter original data use boolean indexing
:
df = df[df['C'].isin(common)]
print (df)
C
A B
bar two 12
foo three 12
If want common rows duplicated at least in 2 groups solution is:
print (df)
A B C
0 foo one 3
1 bar one 4
2 foo two 3
3 bar three 8
4 foo two 14
5 bar two 12
6 foo one 14
7 foo three 12
8 xxx yyy 8
df = df.groupby(['A','B']).sum()
print (df)
C
A B
bar one 4
three 8 <- dupe per bar, three
two 12 <- dupe per bar, two
foo one 17 <-17 is duplicated per group foo, one, so omited
three 12 <- dupe per foo, three
two 17 <-17 is duplicated per group foo, one, so omited
xxx yyy 8 <- dupe per xxx, yyy
common1 = (df.reset_index()
.pivot_table(index='C',columns='A', values='B', aggfunc='size')
.notna()
.sum(axis=1)
)
common1 = common1.index[common1.gt(1)]
print (common1)
Int64Index([8, 12], dtype='int64', name='C')
df1 = df[df['C'].isin(common1)]
print (df1)
C
A B
bar three 8
two 12
foo three 12
xxx yyy 8
Upvotes: 2
Reputation: 30971
To show a more instructive example, I added one row to your source DataFrame, so that it contains:
A B C
0 foo one 3
1 bar one 4
2 foo two 5
3 bar three 8
4 foo two 10
5 bar two 12
6 foo one 14
7 foo three 12
8 xxx yyy 8
I saved the result of grouping in another DataFrame:
df2 = df.groupby(['A','B']).sum()
so it contains:
C
A B
bar one 4
three 8
two 12
foo one 17
three 12
two 15
xxx yyy 8
So as you can see, there are two repeted values in C: 12 and 8. Note that now the index in df2 is unique.
Then, to show the repeating values and their groups, run:
df2[df2.duplicated(keep=False)].sort_values('C')
getting:
C
A B
bar three 8
xxx yyy 8
bar two 12
foo three 12
The above result shows all repeated values and groups (A and B) in which they are.
Upvotes: 2