DevEx
DevEx

Reputation: 4571

Count occurence of row while looping through dataframe

I have a dataframe which has every row repeated 3 times. While looping through it, how can I determine if I've seen a row before and then do something i.e. print something at the second occurrence in the loop?

print df
       user     date
0      User001  2014-11-01
40     User001  2014-11-01
80     User001  2014-11-01
120    User001  2014-11-08
200    User001  2014-11-08
160    User001  2014-11-08
280    User001  2014-11-15
240    User001  2014-11-15
320    User001  2014-11-15
400    User001  2014-11-22
440    User001  2014-11-22
360    User001  2014-11-22
...    ......   ..........
...    ......   ..........
1300   User008  2014-11-22
1341   User008  2014-11-22
1360   User008  2014-11-22

for line in df.itertuples():
    user = line[1]
    date = line[2]

    print user, date
    #do something after second occurrence of tuple i.e. print "second occurrence"

('User001', '2014-11-01')
('User001',  '2014-11-01')
second occurrence
('User001',  '2014-11-01')
('User001',  '2014-11-08')
('User001',  '2014-11-08')
second occurrence
('User001',  '2014-11-08')
('User001',  '2014-11-15')
('User001',  '2014-11-15')
second occurrence
('User001',  '2014-11-15')
('User001',  '2014-11-22')
('User001',  '2014-11-22')
second occurrence
('User001',  '2014-11-22')
('User008',  '2014-11-22')
('User008',  '2014-11-22')
second occurrence
('User008',  '2014-11-22')

Upvotes: 2

Views: 741

Answers (3)

piRSquared
piRSquared

Reputation: 294516

Using Counter to track

from collections import Counter

seen = Counter()
for i, row in df.iterrows():
    tup = tuple(row.values.tolist())
    if seen[tup] == 1:
        print(tup, '  second occurence')
    else:
        print(tup)
    seen.update([tup])

('User001', '2014-11-01')
('User001', '2014-11-01')   second occurence
('User001', '2014-11-01')
('User001', '2014-11-08')
('User001', '2014-11-08')   second occurence
('User001', '2014-11-08')
('User001', '2014-11-15')
('User001', '2014-11-15')   second occurence
('User001', '2014-11-15')
('User001', '2014-11-22')
('User001', '2014-11-22')   second occurence
('User001', '2014-11-22')
('User008', '2014-11-22')
('User008', '2014-11-22')   second occurence
('User008', '2014-11-22')

Upvotes: 1

jezrael
jezrael

Reputation: 863531

You can use cumcount for find all indices of second occurence:

mask = df.groupby(['user', 'date']).cumcount() == 1
idx = mask[mask].index
print (idx)
Int64Index([40, 200, 240, 440], dtype='int64')
for line in df.itertuples():
    print (line.user)
    print (line.date)
    if line.Index in idx:
        print ('second occurrence')

User001
2014-11-01
User001
2014-11-01
second occurrence
User001
2014-11-01
User001
2014-11-08
User001
2014-11-08
second occurrence
User001
2014-11-08
User001
2014-11-15
User001
2014-11-15
second occurrence
User001
2014-11-15
User001
2014-11-22
User001
2014-11-22
second occurrence
User001
2014-11-22

Another solution for find indices is:

idx = df[df.duplicated(['user', 'date']) & 
         df.duplicated(['user', 'date'], keep='last')].index
print (idx)
Int64Index([40, 200, 240, 440], dtype='int64')

Upvotes: 2

David Z
David Z

Reputation: 131740

I'd advise using the DataFrame.duplicated() method to get a boolean index identifying the duplicate rows.

Depending on how you want to display the duplication, you can use this in various ways, but if you want to iterate the rows and print a notice for each one which is a duplicate, something like this might work:

duplicate_index = df.duplicates()
for row, dupl in zip(df, duplicate_index):
    print(row[0], row[1])
    if dupl:
        print('second occurrence')

Upvotes: 1

Related Questions