Reputation: 4571
I have a dataframe which has every row repeated 3 times. While looping through it, how can I determine if I've seen a row before and then do something i.e. print something at the second occurrence in the loop?
print df
user date
0 User001 2014-11-01
40 User001 2014-11-01
80 User001 2014-11-01
120 User001 2014-11-08
200 User001 2014-11-08
160 User001 2014-11-08
280 User001 2014-11-15
240 User001 2014-11-15
320 User001 2014-11-15
400 User001 2014-11-22
440 User001 2014-11-22
360 User001 2014-11-22
... ...... ..........
... ...... ..........
1300 User008 2014-11-22
1341 User008 2014-11-22
1360 User008 2014-11-22
for line in df.itertuples():
user = line[1]
date = line[2]
print user, date
#do something after second occurrence of tuple i.e. print "second occurrence"
('User001', '2014-11-01')
('User001', '2014-11-01')
second occurrence
('User001', '2014-11-01')
('User001', '2014-11-08')
('User001', '2014-11-08')
second occurrence
('User001', '2014-11-08')
('User001', '2014-11-15')
('User001', '2014-11-15')
second occurrence
('User001', '2014-11-15')
('User001', '2014-11-22')
('User001', '2014-11-22')
second occurrence
('User001', '2014-11-22')
('User008', '2014-11-22')
('User008', '2014-11-22')
second occurrence
('User008', '2014-11-22')
Upvotes: 2
Views: 741
Reputation: 294516
Using Counter
to track
from collections import Counter
seen = Counter()
for i, row in df.iterrows():
tup = tuple(row.values.tolist())
if seen[tup] == 1:
print(tup, ' second occurence')
else:
print(tup)
seen.update([tup])
('User001', '2014-11-01')
('User001', '2014-11-01') second occurence
('User001', '2014-11-01')
('User001', '2014-11-08')
('User001', '2014-11-08') second occurence
('User001', '2014-11-08')
('User001', '2014-11-15')
('User001', '2014-11-15') second occurence
('User001', '2014-11-15')
('User001', '2014-11-22')
('User001', '2014-11-22') second occurence
('User001', '2014-11-22')
('User008', '2014-11-22')
('User008', '2014-11-22') second occurence
('User008', '2014-11-22')
Upvotes: 1
Reputation: 863531
You can use cumcount
for find all indices of second occurence:
mask = df.groupby(['user', 'date']).cumcount() == 1
idx = mask[mask].index
print (idx)
Int64Index([40, 200, 240, 440], dtype='int64')
for line in df.itertuples():
print (line.user)
print (line.date)
if line.Index in idx:
print ('second occurrence')
User001
2014-11-01
User001
2014-11-01
second occurrence
User001
2014-11-01
User001
2014-11-08
User001
2014-11-08
second occurrence
User001
2014-11-08
User001
2014-11-15
User001
2014-11-15
second occurrence
User001
2014-11-15
User001
2014-11-22
User001
2014-11-22
second occurrence
User001
2014-11-22
Another solution for find indices is:
idx = df[df.duplicated(['user', 'date']) &
df.duplicated(['user', 'date'], keep='last')].index
print (idx)
Int64Index([40, 200, 240, 440], dtype='int64')
Upvotes: 2
Reputation: 131740
I'd advise using the DataFrame.duplicated()
method to get a boolean index identifying the duplicate rows.
Depending on how you want to display the duplication, you can use this in various ways, but if you want to iterate the rows and print a notice for each one which is a duplicate, something like this might work:
duplicate_index = df.duplicates()
for row, dupl in zip(df, duplicate_index):
print(row[0], row[1])
if dupl:
print('second occurrence')
Upvotes: 1