Richard
Richard

Reputation: 65560

pandas: given a start and end date, add a column for each day in between, then add values?

This is my data:

df = pd.DataFrame([
   {start_date: '2019/12/01', end_date: '2019/12/05', spend: 10000, campaign_id: 1}
   {start_date: '2019/12/05', end_date: '2019/12/09', spend: 50000, campaign_id: 2}
   {start_date: '2019/12/01', end_date: '', spend: 10000, campaign_id: 3}
   {start_date: '2019/12/01', end_date: '2019/12/01', spend: 50, campaign_id: 4}
]);

I need to add a column to each row for each day since 2019/12/01, and calculate the spend on that campaign that day, which I'll get by dividing the spend on the campaign by the total number of days it was active.

So here I'd add a column for each day between 1 December and today (10 December). For row 1, the content of the five columns for 1 Dec to 5 Dec would be 2000, then for the six ocolumns from 5 Dec to 10 Dec it would be zero.

I know pandas is well-designed for this kind of problem, but I have no idea where to start!

Upvotes: 0

Views: 373

Answers (1)

Henry Yik
Henry Yik

Reputation: 22503

Doesn't seem like a straight forward task to me. But first convert your date columns if you haven't already:

df["start_date"] = pd.to_datetime(df["start_date"])
df["end_date"] = pd.to_datetime(df["end_date"])

Then create a helper function for resampling:

def resampler(data, daterange):
    temp = (data.set_index('start_date').groupby('campaign_id')
                 .apply(daterange)
                 .drop("campaign_id",axis=1)
                 .reset_index().rename(columns={"level_1":"start_date"}))
    return temp

Now its a 3 step process. First resample your data according to end_date of each group:

df1 = resampler(df, lambda d: d.reindex(pd.date_range(min(d.index),max(d["end_date"]),freq="D")) if d["end_date"].notnull().all() else d)

df1["spend"] = df1.groupby("campaign_id")["spend"].transform(lambda x: x.mean()/len(x))

With the average values calculated, resample again to current date:

dates = pd.date_range(min(df["start_date"]),pd.Timestamp.today(),freq="D")

df1 = resampler(df1,lambda d: d.reindex(dates))

Finally transpose your dataframe:

df1 = pd.concat([df1.drop("end_date",axis=1).set_index(["campaign_id","start_date"]).unstack(),
                 df1.groupby("campaign_id")["end_date"].min()], axis=1)
df1.columns = [*dates,"end_date"]

print (df1)

#
             2019-12-01 00:00:00  2019-12-02 00:00:00  2019-12-03 00:00:00  2019-12-04 00:00:00  2019-12-05 00:00:00  2019-12-06 00:00:00  2019-12-07 00:00:00  2019-12-08 00:00:00  2019-12-09 00:00:00  2019-12-10 00:00:00   end_date
campaign_id                                                                                                                                                                                                                             
1                         2000.0               2000.0               2000.0               2000.0               2000.0                  NaN                  NaN                  NaN                  NaN                  NaN 2019-12-05
2                            NaN                  NaN                  NaN                  NaN              10000.0              10000.0              10000.0              10000.0              10000.0                  NaN 2019-12-09
3                        10000.0                  NaN                  NaN                  NaN                  NaN                  NaN                  NaN                  NaN                  NaN                  NaN        NaT
4                           50.0                  NaN                  NaN                  NaN                  NaN                  NaN                  NaN                  NaN                  NaN                  NaN 2019-12-01

Upvotes: 1

Related Questions