Reputation: 65560
This is my data:
df = pd.DataFrame([
{start_date: '2019/12/01', end_date: '2019/12/05', spend: 10000, campaign_id: 1}
{start_date: '2019/12/05', end_date: '2019/12/09', spend: 50000, campaign_id: 2}
{start_date: '2019/12/01', end_date: '', spend: 10000, campaign_id: 3}
{start_date: '2019/12/01', end_date: '2019/12/01', spend: 50, campaign_id: 4}
]);
I need to add a column to each row for each day since 2019/12/01, and calculate the spend on that campaign that day, which I'll get by dividing the spend on the campaign by the total number of days it was active.
So here I'd add a column for each day between 1 December and today (10 December). For row 1, the content of the five columns for 1 Dec to 5 Dec would be 2000, then for the six ocolumns from 5 Dec to 10 Dec it would be zero.
I know pandas is well-designed for this kind of problem, but I have no idea where to start!
Upvotes: 0
Views: 373
Reputation: 22503
Doesn't seem like a straight forward task to me. But first convert your date columns if you haven't already:
df["start_date"] = pd.to_datetime(df["start_date"])
df["end_date"] = pd.to_datetime(df["end_date"])
Then create a helper function for resampling:
def resampler(data, daterange):
temp = (data.set_index('start_date').groupby('campaign_id')
.apply(daterange)
.drop("campaign_id",axis=1)
.reset_index().rename(columns={"level_1":"start_date"}))
return temp
Now its a 3 step process. First resample your data according to end_date
of each group:
df1 = resampler(df, lambda d: d.reindex(pd.date_range(min(d.index),max(d["end_date"]),freq="D")) if d["end_date"].notnull().all() else d)
df1["spend"] = df1.groupby("campaign_id")["spend"].transform(lambda x: x.mean()/len(x))
With the average values calculated, resample
again to current date:
dates = pd.date_range(min(df["start_date"]),pd.Timestamp.today(),freq="D")
df1 = resampler(df1,lambda d: d.reindex(dates))
Finally transpose your dataframe:
df1 = pd.concat([df1.drop("end_date",axis=1).set_index(["campaign_id","start_date"]).unstack(),
df1.groupby("campaign_id")["end_date"].min()], axis=1)
df1.columns = [*dates,"end_date"]
print (df1)
#
2019-12-01 00:00:00 2019-12-02 00:00:00 2019-12-03 00:00:00 2019-12-04 00:00:00 2019-12-05 00:00:00 2019-12-06 00:00:00 2019-12-07 00:00:00 2019-12-08 00:00:00 2019-12-09 00:00:00 2019-12-10 00:00:00 end_date
campaign_id
1 2000.0 2000.0 2000.0 2000.0 2000.0 NaN NaN NaN NaN NaN 2019-12-05
2 NaN NaN NaN NaN 10000.0 10000.0 10000.0 10000.0 10000.0 NaN 2019-12-09
3 10000.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaT
4 50.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN 2019-12-01
Upvotes: 1