Reputation: 2287
I am working with a DataFrame with a DatetimeIndex
and two additional columns, A
and B
, and trying to provide an output DataFrame to answer a question like:
Determine the average value
B
for eachA
in months 6-12 after earliest occurrence ofA
I have been working with pd.Grouper
and understand how to group DateTime index in buckets (e.g. df.groupby(pd.Grouper(freq='M')).mean()
), but am unclear as to how to compute the average over some period since the earliest observation for each value A
in the dataset.
The input DataFrame looks something like:
data = {
'A': ['x', 'x', 'x', 'x', 'y', 'y', 'y', 'y', 'y'],
'B': [10, 32, 12, 13, 24, 32, 12, 72, 90],
'created_on': [
'2018-01-31',
'2019-02-25',
'2018-02-12',
'2019-05-31',
'2021-03-12',
'2020-04-23',
'2016-01-11',
'2016-05-02',
'2018-12-31',
]
}
df = pd.DataFrame(data)
df = df.set_index(pd.to_datetime(df['created_on']))
df.drop(['created_on'], axis=1, inplace=True)
This generates a DataFrame like the following:
+------------+---+----+
| created_on | A | B |
+------------+---+----+
| 2018-01-31 | x | 10 |
| 2019-02-25 | x | 32 |
| 2019-05-31 | x | 13 |
| 2021-03-12 | y | 24 |
| 2016-05-02 | y | 72 |
| ... | . | .. |
+------------+---+----+
The target is a desired output shaped like:
+---+----------------------------------------------+
| A | avg_B_6_12_months_after_earliest_observation |
+---+----------------------------------------------+
| x | 12.2 |
| y | 18.1 |
+---+----------------------------------------------+
The values in the avg_B_6_12_months_after_earliest_observation
column above are for example only, they do not correlate to the values provided in the example input DataFrame.
Upvotes: 1
Views: 658
Reputation: 29742
IIUC, you can define a custom function and apply it to pandas.DataFrame.groupby
:
def filtersum(data):
data = data.iloc[1:]
ind = data.index[data.index.month.to_series().between(6, 12)]
return data.loc[ind, "B"].mean()
new_df = df.sort_index().groupby("A", as_index=False).apply(filtersum)
print(new_df)
Output:
A NaN
0 x NaN
1 y 90.0
Logic:
data.iloc[1:]
: Excludes the first observation from the calculationdata.index[data.index.to_series().between(6, 12)
]: filters the indices if their months are between 6 and 12 (inclusive).df.sort_index().groupby
: data must be sorted by their indices so that excluded first observation is indeed the chronological first.Note (based on the sample data):
Customer x
didn't spend anything between June and December:
A B
created_on
2018-01-31 x 10
2018-02-12 x 12
2019-02-25 x 32
2019-05-31 x 13
Customer y
only spent once in 2018-12-31
:
A B
created_on
2016-01-11 y 12
2016-05-02 y 72
2018-12-31 y 90 <<<
2020-04-23 y 32
2021-03-12 y 24
Upvotes: 2
Reputation: 29635
One idea is to use groupby.transform
with idxmin
to align per row the date of the first occurrence of each element in A. Then you can compare the index with the value of the first occurrence adding 6 or 12 months. Use this in a loc to select the wanted rows, groupby
and mean
.
# working dummy example
df = pd.DataFrame(
{'A': ['x', 'x', 'x', 'x', 'y', 'y', 'y', 'y', 'y'],
'B': [10, 32, 12, 13, 24, 32, 12, 72, 90],},
index = pd.to_datetime(
['2018-01-31', '2018-02-25', '2018-08-12',
'2018-10-31', '2021-03-12', '2016-10-23',
'2016-01-11', '2016-05-02', '2016-12-31'])
)
print(df)
# helper series with aligned index and idxmin per A
s = df.groupby(['A'])['B'].transform('idxmin')
print(s)
# 2018-01-31 2018-01-31
# 2018-02-25 2018-01-31 # here first date if x align with this row
# 2018-08-12 2018-01-31
# 2018-10-31 2018-01-31
# 2021-03-12 2016-01-11
# 2016-10-23 2016-01-11
# 2016-01-11 2016-01-11
# 2016-05-02 2016-01-11
# 2016-12-31 2016-01-11
# Name: B, dtype: datetime64[ns]
Now you can get the result
res = (
# select rows with date in the 6-12 months after 1rst occurrence
df.loc[(s.index>=s+pd.DateOffset(months=6))
& (s.index<=s+pd.DateOffset(months=12))]
# groupby and mean
.groupby('A')['B'].mean()
# cosmetic to fit expected output
.rename('avg_B_6_12_months_after_earliest_observation')
.reset_index()
)
print(res)
# A avg_B_6_12_months_after_earliest_observation
# 0 x 12.5
# 1 y 61.0
Upvotes: 3