Reputation: 77
I have a time series problem and I want to aggregate some data based on the values that appear in a certain column. to illustrate, consider the following table
Date | colA | colB | colC |
---|---|---|---|
2019-01-01 | 1 | -10 | Null |
2019-01-02 | 2 | -5 | Null |
2019-01-03 | 3 | 0 | 101 |
2019-01-04 | 4 | 5 | 101 |
2019-01-05 | 5 | 10 | 101 |
2019-01-06 | 6 | 15 | Null |
2019-01-07 | 7 | 20 | 101 |
I want to accomplish the following:
For the previous table, the result would be
agg(colC) | avg(colA) | avg(colB) | delta(Date) [in days] |
---|---|---|---|
101 | 2 | -5 | 2 |
101 | 6.5 | 17.5 | 1 |
I could not find any way to accomplish that so far
Upvotes: 4
Views: 1053
Reputation: 23217
You can set groups of colC
with cumsum()
, and then group by the groups by .groupby()
, as follows:
(Assuming the Null values are NaN
or None):
# Convert date to datetime format
df['Date'] = pd.to_datetime(df['Date'])
# disregard rows if the element X of colC is not null but the element (X-1) is also not null
df2 = df.loc[df['colC'].isna() | df['colC'].shift().isna()]
# set grouping for `colC`
group = (df2['colC'].shift(1).notna() & df2['colC'].isna()).cumsum()
df_out = (df2.groupby(group, as_index=False)
.agg(**{'agg(colC)':('colC', 'last'),
'avg(colA)':('colA', 'mean'),
'avg(colB)':('colB', 'mean'),
'delta(Date)':('Date', lambda x: (x.iloc[-1] - x.iloc[0]).days)})
)
Data Input:
data = {'Date': ['2019-01-01', '2019-01-02', '2019-01-03', '2019-01-04', '2019-01-05', '2019-01-06', '2019-01-07'],
'colA': [1, 2, 3, 4, 5, 6, 7],
'colB': [-10, -5, 0, 5, 10, 15, 20],
'colC': [np.nan, np.nan, 101.0, 101.0, 101.0, np.nan, 101.0]}
df = pd.DataFrame(data)
Date colA colB colC
0 2019-01-01 1 -10 NaN
1 2019-01-02 2 -5 NaN
2 2019-01-03 3 0 101.0
3 2019-01-04 4 5 101.0
4 2019-01-05 5 10 101.0
5 2019-01-06 6 15 NaN
6 2019-01-07 7 20 101.0
Result:
print(df_out)
agg(colC) avg(colA) avg(colB) delta(Date)
0 101.0 2.0 -5.0 2
1 101.0 6.5 17.5 1
Let's add more data for better illustration:
data = {'Date': ['2019-01-01', '2019-01-02', '2019-01-03', '2019-01-04', '2019-01-05', '2019-01-06', '2019-01-07', '2019-01-08', '2019-01-09', '2019-01-10', '2019-01-11', '2019-01-12', '2019-01-13', '2019-01-14', '2019-01-15'],
'colA': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15],
'colB': [-10, -5, 0, 5, 10, 15, 15, 15, 20, 20, 15, 15, 15, 20, 20],
'colC': [np.nan, np.nan, 101, 101, 101, np.nan, np.nan, 102, 103, 104, np.nan, np.nan, np.nan, 112, 113]}
df = pd.DataFrame(data)
Date colA colB colC
0 2019-01-01 1 -10 NaN
1 2019-01-02 2 -5 NaN
2 2019-01-03 3 0 101.0
3 2019-01-04 4 5 101.0
4 2019-01-05 5 10 101.0
5 2019-01-06 6 15 NaN
6 2019-01-07 7 15 NaN
7 2019-01-08 8 15 102.0
8 2019-01-09 9 20 103.0
9 2019-01-10 10 20 104.0
10 2019-01-11 11 15 NaN
11 2019-01-12 12 15 NaN
12 2019-01-13 13 15 NaN
13 2019-01-14 14 20 112.0
14 2019-01-15 15 20 113.0
After date format conversion, we discard unwanted rows by keeping only rows that are either with NaN
on colC
or even when current row is not NaN
but its previous row is NaN
on colC
.
# Convert date to datetime format
df['Date'] = pd.to_datetime(df['Date'])
df2 = df.loc[df['colC'].isna() | df['colC'].shift().isna()]
Result:
print(df2)
Date colA colB colC
0 2019-01-01 1 -10 NaN
1 2019-01-02 2 -5 NaN
2 2019-01-03 3 0 101.0
5 2019-01-06 6 15 NaN
6 2019-01-07 7 15 NaN
7 2019-01-08 8 15 102.0
10 2019-01-11 11 15 NaN
11 2019-01-12 12 15 NaN
12 2019-01-13 13 15 NaN
13 2019-01-14 14 20 112.0
Now, for setting groups, for illustration purpose, let's show the intermediate column values also in the dataframe by making intermediate columns:
df2['Group_indicator'] = df2['colC'].shift(1).notna() & df2['colC'].isna()
We set boolean mask so that the first entry of a group is set to True
and the other entries False
(there's exception on first row, but no problem when we generate the serial numbers for the groups)
Result:
Date colA colB colC Group_indicator
0 2019-01-01 1 -10 NaN False
1 2019-01-02 2 -5 NaN False
2 2019-01-03 3 0 101.0 False
5 2019-01-06 6 15 NaN True
6 2019-01-07 7 15 NaN False
7 2019-01-08 8 15 102.0 False
10 2019-01-11 11 15 NaN True
11 2019-01-12 12 15 NaN False
12 2019-01-13 13 15 NaN False
13 2019-01-14 14 20 112.0 False
Then, we generate the group numbers:
df2['Group'] = df2['Group_indicator'].cumsum()
Result:
Date colA colB colC Group_indicator Group
0 2019-01-01 1 -10 NaN False 0
1 2019-01-02 2 -5 NaN False 0
2 2019-01-03 3 0 101.0 False 0
5 2019-01-06 6 15 NaN True 1
6 2019-01-07 7 15 NaN False 1
7 2019-01-08 8 15 102.0 False 1
10 2019-01-11 11 15 NaN True 2
11 2019-01-12 12 15 NaN False 2
12 2019-01-13 13 15 NaN False 2
13 2019-01-14 14 20 112.0 False 2
You can see that the serial numbers of groups are generated for each group with same group numbers for all entries of the same group.
After that, we group with this group number and aggregate the rows with the code:
df_out = (df2.groupby('Group', as_index=False)
.agg(**{'agg(colC)':('colC', 'last'),
'avg(colA)':('colA', 'mean'),
'avg(colB)':('colB', 'mean'),
'delta(Date)':('Date', lambda x: (x.iloc[-1] - x.iloc[0]).days)})
)
Here, as our interested entry of colC
is the last entry within a group, we aggregate with 'last'
on colC
.
For the aggregation on Date
to get delta(Date)
, we take the last/first entry of Date
with the group by x.iloc[-1]
/ x.iloc[0]
(i.e. the last and first entries of the Pandas Series for Date
within a group) and calculate their difference by subtraction. As this is a time series, the last/first entries should correspond to the max/min entries of Date
within the group.
Hence, the result:
Group agg(colC) avg(colA) avg(colB) delta(Date)
0 0 101.0 2.0 -5.00 2
1 1 102.0 7.0 15.00 2
2 2 112.0 12.5 16.25 3
Here, the result has a column Group
while our original code version without it. This is because we defined a separate Pandas series group
for the grouping, rather than defining a helper column Group
in this illustration.
Upvotes: 1
Reputation: 23146
Try with groupby
:
#convert Date column to datetime if needed
df["Date"] = pd.to_datetime(df["Date"])
#keep only rows where there aren't consecutive non-null values
df2 = df[~(df["colC"].notnull()&df["colC"].shift().notnull())]
#groupby consecutive null values and aggregate
output = df2.groupby(df2["colC"].notnull().shift().cumsum().fillna(0)) \
.agg({"colA": "mean", \
"colB": "mean", \
"colC": "first", \
"Date": lambda x: (x.max()-x.min()).days}) \
.rename_axis(None) \
.rename(columns={"Date": "Delta"})
>>> output
colA colB colC Delta
0.0 2.0 -5.0 101.0 2
1.0 6.5 17.5 101.0 1
Upvotes: 5