gehbiszumeis
gehbiszumeis

Reputation: 3711

Pandas - Find duplicated entries in one column within rows with equal values in another column

Assume a dataframe df like the following:

  col1 col2
0    a    A
1    b    A
2    c    A
3    c    B
4    a    B
5    b    B
6    a    C
7    a    C
8    c    C

I would like to find those values of col2 where there are duplicate entries a in col1. In this example the result should be ['C]', since for df['col2'] == 'C', col1 has two a as entries.

I tried this approach

df[(df['col1'] == 'a') & (df['col2'].duplicated())]['col2'].to_list()

but this only works, if the a within a block of rows defined by col2 is at the beginning or end of the block, depending on how you define the keep keyword of duplicated(). In this example, it returns ['B', 'C'], which is not what I want.

Upvotes: 2

Views: 74

Answers (3)

Mayank Porwal
Mayank Porwal

Reputation: 34056

A more generalised solution using Groupby.count and index.get_level_values:

In [2632]: x = df.groupby(['col1', 'col2']).col2.count().to_frame()
In [2642]: res = x[x.col2 > 1].index.get_level_values(1).tolist()

In [2643]: res
Out[2643]: ['C']

Upvotes: 1

jezrael
jezrael

Reputation: 862671

Use Series.duplicated only for filtered rows:

df1 = df[df['col1'] == 'a']

out = df1.loc[df1['col2'].duplicated(keep=False), 'col2'].unique().tolist()
print (out)
['C']

Another idea is use DataFrame.duplicated by both columns and chain wit hrows match only a:

out = df.loc[df.duplicated(subset=['col1', 'col2'], keep=False) & 
             (df['col1'] == 'a'), 'col2'].unique().tolist()
print (out)
['C']

Upvotes: 2

taras
taras

Reputation: 6914

You can group your col1 by col2 and count occurrences of 'a'

>>> s = df.col1.groupby(df.col2).sum().str.count('a').gt(1)
>>> s[s].index.values
array(['C'], dtype=object)

Upvotes: 1

Related Questions