sshr
sshr

Reputation: 155

Finding duplicate rows python

I have timestamp and id variables in my dataframe (df)

timestamp                    id
2016-06-09 8:33:37           a1  
2016-06-09 8:33:37           a1  
2016-06-09 8:33:38           a1
2016-06-09 8:33:39           a1
2016-06-09 8:33:39           a1
2016-06-09 8:33:37           b1
2016-06-09 8:33:38           b1

Each id can't have two timestamps. I have to print these duplicate timestamps for each id. In my above case, the output should be for rows 1,2,4,5

The following code will give the duplicate timestamp

set([x for x in df['timestamp'] if df['timestamp'].count(x) > 1])

How to consider id along with timestamp to have the duplicate rows?

Upvotes: 2

Views: 6409

Answers (2)

Little Bobby Tables
Little Bobby Tables

Reputation: 4744

If you wish to find all duplicates then use the duplicated method. It only works on the columns. On the other hand df.index.duplicated works on the index. Therefore we do a quick reset_index to bring the index into the columns.

df = df.reset_index()
df.loc[df.duplicated(keep=False)]

    index               id
0   2016-06-09 8:33:37  a1
1   2016-06-09 8:33:37  a1
3   2016-06-09 8:33:39  a1
4   2016-06-09 8:33:39  a1

If you just wish to remove duplicates there is the DataFrame method drop_duplicates.

df = df.reset_index()
df = df.drop_duplicates() # keep='first' by default.

    index               id
0   2016-06-09 8:33:37  a1
2   2016-06-09 8:33:38  a1
3   2016-06-09 8:33:39  a1
5   2016-06-09 8:33:37  b1
6   2016-06-09 8:33:38  b1

If you wish to get back your old index then set_index using the default columns name 'index' for either of the above then simply use.

df.set_index('index')

                    id
index               
2016-06-09 8:33:37  a1
2016-06-09 8:33:38  a1
2016-06-09 8:33:39  a1
2016-06-09 8:33:37  b1
2016-06-09 8:33:38  b1

The above methods allow you choose whether to keep the first, last or none of the duplicates by setting the keep attribute to 'first', 'last' or False. So to remove all duplicates in df then use keep=False.

Upvotes: 0

jezrael
jezrael

Reputation: 862541

You can use groupby and get mask of all duplicates values per group by Series.duplicated. Last use boolean indexing:

print (df.groupby(['id'])['timestamp'].apply(lambda x: x.duplicated(keep=False)))
0     True
1     True
2    False
3     True
4     True
5    False
6    False
Name: timestamp, dtype: bool

print (df[df.groupby(['id'])['timestamp'].apply(lambda x: x.duplicated(keep=False))])
            timestamp  id
0 2016-06-09 08:33:37  a1
1 2016-06-09 08:33:37  a1
3 2016-06-09 08:33:39  a1
4 2016-06-09 08:33:39  a1

Upvotes: 1

Related Questions