Ricardo Vilaça
Ricardo Vilaça

Reputation: 996

Counting values before a date, iterating for every row

So i have this dataframe:

+----------------+--------------+------------+
| TransactionID  |   TimeKey    | ProductKey |
+----------------+--------------+------------+
|        3621015 | '2018-01-01' |       6151 |
|        3621015 | '2018-01-01' |       4354 |
|        3487373 | '2018-01-02' |       6151 |
|        4378337 | '2018-01-03' |       6151 |
|        4378337 | '2018-01-03' |       4354 |
|        4378337 | '2018-01-03' |       3783 |
+----------------+--------------+------------+

It is a database about transactions and their respective date, and the product that was bought.

I want to count, for every row, the number of times that a certain product has appeared before. Something like this:

+----------------+--------------+------------+-------+
| TransactionID  |   TimeKey    | ProductKey | count |
+----------------+--------------+------------+-------+
|        3621015 | '2018-01-01' |       6151 | NaN/0 |
|        3621015 | '2018-01-01' |       4354 | NaN/0 |
|        3487373 | '2018-01-02' |       6151 | 1     |
|        4378337 | '2018-01-03' |       6151 | 2     |
|        4378337 | '2018-01-03' |       4354 | 1     |
|        4378337 | '2018-01-03' |       3783 | NaN/0 |
+----------------+--------------+------------+-------+

Right now i could do it, but the way i do it is awful. My solution:

for index, row in data.iterrows():
    data['count'] = data[(data['ProductKey'] == row['ProductKey']) & (data['TimeKey'] <= row['TimeKey'])]['ProductKey'].count()

Doing this for 30 lines is okay, but i'm working with a couple houndred thousand lines :\ Looking for ideas

Upvotes: 1

Views: 38

Answers (2)

Andy L.
Andy L.

Reputation: 25239

Assume your TimeKey is already in increasing order as in the sample, you need groupby and cumcount as follows

data['count'] =  data.groupby('ProductKey').cumcount()

Out[384]:
   TransactionID       TimeKey  ProductKey  count
0        3621015  '2018-01-01'        6151      0
1        3621015  '2018-01-01'        4354      0
2        3487373  '2018-01-02'        6151      1
3        4378337  '2018-01-03'        6151      2
4        4378337  '2018-01-03'        4354      1
5        4378337  '2018-01-03'        3783      0

Note: If TimeKey is not in increasing order, you need to sort dataframe by it before running command above. If you sort it, you need it in proper datetime format for the correct sorting

Upvotes: 2

Ananth Reddy
Ananth Reddy

Reputation: 317

data['Count'] = data.groupby(['ProductKey'])['ProductKey'].count().

This provides the count of each ProductKey. If you do not want counts to be NaN, then run the following snippet.

data.fillna(value = 0, inplace = True)

Upvotes: 0

Related Questions