Reputation: 3711
Assume a dataframe df
like the following:
col1 col2
0 a A
1 b A
2 c A
3 c B
4 a B
5 b B
6 a C
7 a C
8 c C
I would like to find those values of col2
where there are duplicate entries a
in col1
. In this example the result should be ['C]'
, since for df['col2'] == 'C'
, col1
has two a
as entries.
I tried this approach
df[(df['col1'] == 'a') & (df['col2'].duplicated())]['col2'].to_list()
but this only works, if the a
within a block of rows defined by col2
is at the beginning or end of the block, depending on how you define the keep
keyword of duplicated()
. In this example, it returns ['B', 'C']
, which is not what I want.
Upvotes: 2
Views: 74
Reputation: 34056
A more generalised solution using Groupby.count
and index.get_level_values
:
In [2632]: x = df.groupby(['col1', 'col2']).col2.count().to_frame()
In [2642]: res = x[x.col2 > 1].index.get_level_values(1).tolist()
In [2643]: res
Out[2643]: ['C']
Upvotes: 1
Reputation: 862671
Use Series.duplicated
only for filtered rows:
df1 = df[df['col1'] == 'a']
out = df1.loc[df1['col2'].duplicated(keep=False), 'col2'].unique().tolist()
print (out)
['C']
Another idea is use DataFrame.duplicated
by both columns and chain wit hrows match only a
:
out = df.loc[df.duplicated(subset=['col1', 'col2'], keep=False) &
(df['col1'] == 'a'), 'col2'].unique().tolist()
print (out)
['C']
Upvotes: 2
Reputation: 6914
You can group your col1
by col2
and count occurrences of 'a'
>>> s = df.col1.groupby(df.col2).sum().str.count('a').gt(1)
>>> s[s].index.values
array(['C'], dtype=object)
Upvotes: 1