Reputation: 6499
I have a dataframe similar to this one containing an enumerated date, and ID, and number of items sold for given date and ID:
Date ID num_sold
0 1 0
2 13
1 1 6
2 0
2 1 0
2 0
3 1 5
2 4
For each ID, I want to compute the Date when last sale happened. So I have the following dataframe:
Date ID num_sold last_sale
0 1 0 -1
2 13 -1
1 1 6 -1
2 0 0
2 1 0 1
2 0 0
3 1 5 1
2 4 0
I want to compute the column last_sale
, where 1 and 0 are the date values when given an ID
, num_sold
was last non-zero. If there is no such last date (e.g. first month in dataset), it is -1 (or could be nan).
In the given example,
At date 1, ID 1 was not sold ever before so we set last_sale
to -1. ID 2 was last sold in month 0, so we set last_sale
to 0.
Similarly, at date 3, ID 1 was last sold in month 1 so last_sale
is 1, and ID 2 was last sold in month 0, so last_sale
is 0.
What's the most efficient and concise way to accomplish this?
Upvotes: 2
Views: 274
Reputation: 1661
First you create a column for the current sale's date :
df["current_sale_date"] = 0
df.loc[df.num_sold != 0, "current_sale_date"] = df.Date
You create an other column for the last sales's date. First you replace 0 into np.nan in order to allow .ffill() to work. then for each ID, you get the value of the current date of sale thanks to .fill().
df.current_sale_date = df.current_sale_date.replace(0, np.nan)
df["last_sale_date"] = df.groupby(['ID'])['current_sale_date'].ffill()
Then you get the previous date of sale with shift by ID
df.last_sale_date = df.groupby(['ID'])['last_sale_date'].shift()
For the first date, you set -1
df.loc[df.Date == df.Date.idxmin(), "last_sale_date"] = -1
You replace back nan into 0 because 0 is a date
df.last_sale_date = df.last_sale_date.replace(np.nan, 0)
For the first time an ID is sold, if there is no previous date of sale, you set -1
df.loc[(df.current_sale_date == 1) & (df.last_sale_date == 0), "last_sale_date"] = -1
Upvotes: 3
Reputation: 470
First, Let's create your DataFrame df1.
import pandas as pd
df1 = pd.DataFrame({'Date': ['0', '0', '1', '1', '2', '2', '3', '3'],
'ID': ['1', '2', '1', '2', '1', '2', '1', '2'],
'num_sold': [0, 13, 6, 0, 0, 0, 5, 4]})
print(df1)
Output:
Date ID num_sold
0 0 1 0
1 0 2 13
2 1 1 6
3 1 2 0
4 2 1 0
5 2 2 0
6 3 1 5
7 3 2 4
Now we'll group it by ID, apply your logic to every ID group and concatenate these groups.
groups = []
for name, group in df1.groupby(['ID']):
group.loc[group['num_sold'].shift(1)!=0, 'last_sale'] = group['Date'].shift(1)
group['last_sale'] = group['last_sale'].fillna(method='ffill').fillna('-1')
groups.append(group)
df2 = pd.concat(groups).sort_values(['Date', 'ID'])
print(df2)
Your result:
Date ID num_sold last_sale
0 0 1 0 -1
1 0 2 13 -1
2 1 1 6 -1
3 1 2 0 0
4 2 1 0 1
5 2 2 0 0
6 3 1 5 1
7 3 2 4 0
Good luck!
Upvotes: 1