Reputation: 996
So i have this dataframe:
+----------------+--------------+------------+
| TransactionID | TimeKey | ProductKey |
+----------------+--------------+------------+
| 3621015 | '2018-01-01' | 6151 |
| 3621015 | '2018-01-01' | 4354 |
| 3487373 | '2018-01-02' | 6151 |
| 4378337 | '2018-01-03' | 6151 |
| 4378337 | '2018-01-03' | 4354 |
| 4378337 | '2018-01-03' | 3783 |
+----------------+--------------+------------+
It is a database about transactions and their respective date, and the product that was bought.
I want to count, for every row, the number of times that a certain product has appeared before. Something like this:
+----------------+--------------+------------+-------+
| TransactionID | TimeKey | ProductKey | count |
+----------------+--------------+------------+-------+
| 3621015 | '2018-01-01' | 6151 | NaN/0 |
| 3621015 | '2018-01-01' | 4354 | NaN/0 |
| 3487373 | '2018-01-02' | 6151 | 1 |
| 4378337 | '2018-01-03' | 6151 | 2 |
| 4378337 | '2018-01-03' | 4354 | 1 |
| 4378337 | '2018-01-03' | 3783 | NaN/0 |
+----------------+--------------+------------+-------+
Right now i could do it, but the way i do it is awful. My solution:
for index, row in data.iterrows():
data['count'] = data[(data['ProductKey'] == row['ProductKey']) & (data['TimeKey'] <= row['TimeKey'])]['ProductKey'].count()
Doing this for 30 lines is okay, but i'm working with a couple houndred thousand lines :\ Looking for ideas
Upvotes: 1
Views: 38
Reputation: 25239
Assume your TimeKey
is already in increasing order as in the sample, you need groupby
and cumcount
as follows
data['count'] = data.groupby('ProductKey').cumcount()
Out[384]:
TransactionID TimeKey ProductKey count
0 3621015 '2018-01-01' 6151 0
1 3621015 '2018-01-01' 4354 0
2 3487373 '2018-01-02' 6151 1
3 4378337 '2018-01-03' 6151 2
4 4378337 '2018-01-03' 4354 1
5 4378337 '2018-01-03' 3783 0
Note: If TimeKey
is not in increasing order, you need to sort dataframe by it before running command above. If you sort it, you need it in proper datetime format for the correct sorting
Upvotes: 2
Reputation: 317
data['Count'] = data.groupby(['ProductKey'])['ProductKey'].count()
.
This provides the count of each ProductKey. If you do not want counts to be NaN, then run the following snippet.
data.fillna(value = 0, inplace = True)
Upvotes: 0