Counting values before a date, iterating for every row

Question

So i have this dataframe:

+----------------+--------------+------------+
| TransactionID  |   TimeKey    | ProductKey |
+----------------+--------------+------------+
|        3621015 | '2018-01-01' |       6151 |
|        3621015 | '2018-01-01' |       4354 |
|        3487373 | '2018-01-02' |       6151 |
|        4378337 | '2018-01-03' |       6151 |
|        4378337 | '2018-01-03' |       4354 |
|        4378337 | '2018-01-03' |       3783 |
+----------------+--------------+------------+

It is a database about transactions and their respective date, and the product that was bought.

I want to count, for every row, the number of times that a certain product has appeared before. Something like this:

+----------------+--------------+------------+-------+
| TransactionID  |   TimeKey    | ProductKey | count |
+----------------+--------------+------------+-------+
|        3621015 | '2018-01-01' |       6151 | NaN/0 |
|        3621015 | '2018-01-01' |       4354 | NaN/0 |
|        3487373 | '2018-01-02' |       6151 | 1     |
|        4378337 | '2018-01-03' |       6151 | 2     |
|        4378337 | '2018-01-03' |       4354 | 1     |
|        4378337 | '2018-01-03' |       3783 | NaN/0 |
+----------------+--------------+------------+-------+

Right now i could do it, but the way i do it is awful. My solution:

for index, row in data.iterrows():
    data['count'] = data[(data['ProductKey'] == row['ProductKey']) & (data['TimeKey'] <= row['TimeKey'])]['ProductKey'].count()

Doing this for 30 lines is okay, but i'm working with a couple houndred thousand lines :\ Looking for ideas

Andy L. · Accepted Answer

Assume your TimeKey is already in increasing order as in the sample, you need groupby and cumcount as follows

data['count'] =  data.groupby('ProductKey').cumcount()

Out[384]:
   TransactionID       TimeKey  ProductKey  count
0        3621015  '2018-01-01'        6151      0
1        3621015  '2018-01-01'        4354      0
2        3487373  '2018-01-02'        6151      1
3        4378337  '2018-01-03'        6151      2
4        4378337  '2018-01-03'        4354      1
5        4378337  '2018-01-03'        3783      0

Note: If TimeKey is not in increasing order, you need to sort dataframe by it before running command above. If you sort it, you need it in proper datetime format for the correct sorting

Counting values before a date, iterating for every row

Answers (2)

Related Questions