Reputation: 5612
I start from an input DataFrame
looking like:
df = pd.DataFrame({"created_on":[datetime(2015, 1, 3),
datetime(2015 , 1, 5),
datetime(2015, 2, 24),
datetime(2015, 3, 6),
datetime(2015, 3, 17),
datetime(2015, 5, 31),
datetime(2015, 6, 3)],
"value":[3, 2, 1, 1, 3, 2, 2]
}
)
ie
created_on value
0 2015-01-03 3
1 2015-01-05 2
2 2015-02-24 1
3 2015-03-06 1
4 2015-03-17 3
5 2015-05-31 2
6 2015-06-03 2
I would like to get, for each month, the median of the observations whose created_on
is in the current month or within the past 2 months from the current month.
The expected output for the input data above is:
month median_value
2015-01-01 2.5
2015-02-01 2
2015-03-01 2
2015-04-01 1
2015-05-01 2
2015-06-01 2
ie for "2015-01-01", only observations "2015-01-03" and "2015-01-05" were used, for month "2015-02-01", I take "2015-01-03", "2015-01-05" and "2015-02-24" and so on.
I wanted to extract the month from the date column and then use groupby
like this:
df['created_on_month'] = df['created_on'].apply(
lambda dt: datetime(dt.year, dt.month, 1)
)
df.groupby('created_on_month').median()
But I don't know how to aggregate within 3 months in groupby
, in such a manner that one row can belong to several groups. Other problem with this solution is that empty months ("2015-04-01" in the example above) do not appear in the result.
I have also tried using rolling_median
provided by pandas that could do the job but it goes with a resample
that works fine when just counting the number of observations, but median of medians is not the same.
Finally, I could also use a simple loop like:
months = pd.date_range('2015-01-01', '2015-06-01', freq='MS')
output = pd.DataFrame(index=months, columns=("month", "median_value"))
for m in months:
tmp = df [ (df.created_on >= (m - pd.DateOffset(months=2)))
& (df.created_on <= m+pd.DateOffset(months=1))]
res = {"month":m, "median_value":tmp["value"].median()}
output.loc[m] = res
print output
that yields:
month median_value
2015-01-01 2015-01-01 00:00:00 2.5
2015-02-01 2015-02-01 00:00:00 2
2015-03-01 2015-03-01 00:00:00 2
2015-04-01 2015-04-01 00:00:00 1
2015-05-01 2015-05-01 00:00:00 2
2015-06-01 2015-06-01 00:00:00 2
but if a more elegant solution exists, I would be happy to learn it.
Upvotes: 3
Views: 3119
Reputation: 30424
OK, this should be pretty close. I'm using a 90 day window b/c I'm not sure if I can do a 3 month window. Otherwise, it seems to work pretty well.
df2 = pd.rolling_apply( df.set_index('created_on')['value'], window=90,
func=np.nanmedian, freq='d', min_periods=1 )
df2[ (df2.index.day == 1)[1:] ] # [1:] is a kludge to get end of month
# rather than beginning, probably a
# better way to do that...
created_on
2015-01-31 2.5
2015-02-28 2.0
2015-03-31 2.0
2015-04-30 1.0
2015-05-31 2.0
Note that because my method is different from yours, it tabs out as end of month rather than beginning of month, but that shouldn't affect the results in any way and I think end of month is actually more accurate anyway.
I'm not sure of the best way to get a result for June-2015 to print out, but it is stored correctly in df2 for June 3:
df2.tail(1)
created_on
2015-06-03 2
So it's just a matter of how best to extract and display the info. I suppose just padding out the original dataframe with a missing value for June 30 would be one way.
Upvotes: 1