Reputation: 155
I have timestamp
and id
variables in my dataframe (df
)
timestamp id
2016-06-09 8:33:37 a1
2016-06-09 8:33:37 a1
2016-06-09 8:33:38 a1
2016-06-09 8:33:39 a1
2016-06-09 8:33:39 a1
2016-06-09 8:33:37 b1
2016-06-09 8:33:38 b1
Each id
can't have two timestamps. I have to print these duplicate timestamps for each id
. In my above case, the output should be for rows 1,2,4,5
The following code will give the duplicate timestamp
set([x for x in df['timestamp'] if df['timestamp'].count(x) > 1])
How to consider id
along with timestamp
to have the duplicate rows?
Upvotes: 2
Views: 6409
Reputation: 4744
If you wish to find all duplicates then use the duplicated
method. It only works on the columns. On the other hand df.index.duplicated
works on the index. Therefore we do a quick reset_index
to bring the index into the columns.
df = df.reset_index()
df.loc[df.duplicated(keep=False)]
index id
0 2016-06-09 8:33:37 a1
1 2016-06-09 8:33:37 a1
3 2016-06-09 8:33:39 a1
4 2016-06-09 8:33:39 a1
If you just wish to remove duplicates there is the DataFrame method drop_duplicates
.
df = df.reset_index()
df = df.drop_duplicates() # keep='first' by default.
index id
0 2016-06-09 8:33:37 a1
2 2016-06-09 8:33:38 a1
3 2016-06-09 8:33:39 a1
5 2016-06-09 8:33:37 b1
6 2016-06-09 8:33:38 b1
If you wish to get back your old index then set_index using the default columns name 'index' for either of the above then simply use.
df.set_index('index')
id
index
2016-06-09 8:33:37 a1
2016-06-09 8:33:38 a1
2016-06-09 8:33:39 a1
2016-06-09 8:33:37 b1
2016-06-09 8:33:38 b1
The above methods allow you choose whether to keep the first, last or none of the duplicates by setting the keep
attribute to 'first'
, 'last'
or False
. So to remove all duplicates in df
then use keep=False
.
Upvotes: 0
Reputation: 862541
You can use groupby
and get mask of all duplicates values per group by Series.duplicated
. Last use boolean indexing
:
print (df.groupby(['id'])['timestamp'].apply(lambda x: x.duplicated(keep=False)))
0 True
1 True
2 False
3 True
4 True
5 False
6 False
Name: timestamp, dtype: bool
print (df[df.groupby(['id'])['timestamp'].apply(lambda x: x.duplicated(keep=False))])
timestamp id
0 2016-06-09 08:33:37 a1
1 2016-06-09 08:33:37 a1
3 2016-06-09 08:33:39 a1
4 2016-06-09 08:33:39 a1
Upvotes: 1