Python Pandas aggregate past rows when a condition is met for time series

Question

I have a time series problem and I want to aggregate some data based on the values that appear in a certain column. to illustrate, consider the following table

Date	colA	colB	colC
2019-01-01	1	-10	Null
2019-01-02	2	-5	Null
2019-01-03	3	0	101
2019-01-04	4	5	101
2019-01-05	5	10	101
2019-01-06	6	15	Null
2019-01-07	7	20	101

I want to accomplish the following:

the moment the value of colC is not null, aggregate the values up to that row and get the delta for the date column
if the element X of colC is not null but the element (X-1) is also not null, just disregard row X.

For the previous table, the result would be

agg(colC)	avg(colA)	avg(colB)	delta(Date) [in days]
101	2	-5	2
101	6.5	17.5	1

I could not find any way to accomplish that so far

SeaBean · Accepted Answer

You can set groups of colC with cumsum(), and then group by the groups by .groupby(), as follows:

(Assuming the Null values are NaN or None):

# Convert date to datetime format
df['Date'] = pd.to_datetime(df['Date'])

# disregard rows if the element X of colC is not null but the element (X-1) is also not null
df2 = df.loc[df['colC'].isna() | df['colC'].shift().isna()]

# set grouping for `colC`
group = (df2['colC'].shift(1).notna() & df2['colC'].isna()).cumsum()

df_out = (df2.groupby(group, as_index=False)
             .agg(**{'agg(colC)':('colC', 'last'), 
                     'avg(colA)':('colA', 'mean'),
                     'avg(colB)':('colB', 'mean'), 
                     'delta(Date)':('Date', lambda x: (x.iloc[-1] - x.iloc[0]).days)})
         )

Data Input:

data = {'Date': ['2019-01-01', '2019-01-02', '2019-01-03', '2019-01-04', '2019-01-05', '2019-01-06', '2019-01-07'],
 'colA': [1, 2, 3, 4, 5, 6, 7],
 'colB': [-10, -5, 0, 5, 10, 15, 20],
 'colC': [np.nan, np.nan, 101.0, 101.0, 101.0, np.nan, 101.0]}
df = pd.DataFrame(data)

         Date  colA  colB   colC
0  2019-01-01     1   -10    NaN
1  2019-01-02     2    -5    NaN
2  2019-01-03     3     0  101.0
3  2019-01-04     4     5  101.0
4  2019-01-05     5    10  101.0
5  2019-01-06     6    15    NaN
6  2019-01-07     7    20  101.0

Result:

print(df_out)

   agg(colC)  avg(colA)  avg(colB)  delta(Date)
0      101.0        2.0       -5.0            2
1      101.0        6.5       17.5            1

Explanation / Illustration

Let's add more data for better illustration:

data = {'Date': ['2019-01-01', '2019-01-02', '2019-01-03', '2019-01-04', '2019-01-05', '2019-01-06', '2019-01-07', '2019-01-08', '2019-01-09', '2019-01-10', '2019-01-11', '2019-01-12', '2019-01-13', '2019-01-14', '2019-01-15'],
 'colA': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15], 
 'colB': [-10, -5, 0, 5, 10, 15, 15, 15, 20, 20, 15, 15, 15, 20, 20],
 'colC': [np.nan, np.nan, 101, 101, 101, np.nan, np.nan, 102, 103, 104, np.nan, np.nan, np.nan, 112, 113]}

df = pd.DataFrame(data)


          Date  colA  colB   colC
0   2019-01-01     1   -10    NaN
1   2019-01-02     2    -5    NaN
2   2019-01-03     3     0  101.0
3   2019-01-04     4     5  101.0
4   2019-01-05     5    10  101.0
5   2019-01-06     6    15    NaN
6   2019-01-07     7    15    NaN
7   2019-01-08     8    15  102.0
8   2019-01-09     9    20  103.0
9   2019-01-10    10    20  104.0
10  2019-01-11    11    15    NaN
11  2019-01-12    12    15    NaN
12  2019-01-13    13    15    NaN
13  2019-01-14    14    20  112.0
14  2019-01-15    15    20  113.0

After date format conversion, we discard unwanted rows by keeping only rows that are either with NaN on colC or even when current row is not NaN but its previous row is NaN on colC.

# Convert date to datetime format
df['Date'] = pd.to_datetime(df['Date'])

df2 = df.loc[df['colC'].isna() | df['colC'].shift().isna()]

Result:

print(df2)

         Date  colA  colB   colC
0  2019-01-01     1   -10    NaN
1  2019-01-02     2    -5    NaN
2  2019-01-03     3     0  101.0
5  2019-01-06     6    15    NaN
6  2019-01-07     7    15    NaN
7  2019-01-08     8    15  102.0
10 2019-01-11    11    15    NaN
11 2019-01-12    12    15    NaN
12 2019-01-13    13    15    NaN
13 2019-01-14    14    20  112.0

Now, for setting groups, for illustration purpose, let's show the intermediate column values also in the dataframe by making intermediate columns:

df2['Group_indicator'] = df2['colC'].shift(1).notna() & df2['colC'].isna()

We set boolean mask so that the first entry of a group is set to True and the other entries False (there's exception on first row, but no problem when we generate the serial numbers for the groups)

Result:

         Date  colA  colB   colC  Group_indicator
0  2019-01-01     1   -10    NaN            False
1  2019-01-02     2    -5    NaN            False
2  2019-01-03     3     0  101.0            False
5  2019-01-06     6    15    NaN             True
6  2019-01-07     7    15    NaN            False
7  2019-01-08     8    15  102.0            False
10 2019-01-11    11    15    NaN             True
11 2019-01-12    12    15    NaN            False
12 2019-01-13    13    15    NaN            False
13 2019-01-14    14    20  112.0            False

Then, we generate the group numbers:

df2['Group'] = df2['Group_indicator'].cumsum()

Result:

         Date  colA  colB   colC  Group_indicator  Group
0  2019-01-01     1   -10    NaN            False      0
1  2019-01-02     2    -5    NaN            False      0
2  2019-01-03     3     0  101.0            False      0
5  2019-01-06     6    15    NaN             True      1
6  2019-01-07     7    15    NaN            False      1
7  2019-01-08     8    15  102.0            False      1
10 2019-01-11    11    15    NaN             True      2
11 2019-01-12    12    15    NaN            False      2
12 2019-01-13    13    15    NaN            False      2
13 2019-01-14    14    20  112.0            False      2

You can see that the serial numbers of groups are generated for each group with same group numbers for all entries of the same group.

After that, we group with this group number and aggregate the rows with the code:

df_out = (df2.groupby('Group', as_index=False)
             .agg(**{'agg(colC)':('colC', 'last'), 
                     'avg(colA)':('colA', 'mean'),
                     'avg(colB)':('colB', 'mean'), 
                     'delta(Date)':('Date', lambda x: (x.iloc[-1] - x.iloc[0]).days)})
         )

Here, as our interested entry of colC is the last entry within a group, we aggregate with 'last' on colC.

For the aggregation on Date to get delta(Date), we take the last/first entry of Date with the group by x.iloc[-1] / x.iloc[0] (i.e. the last and first entries of the Pandas Series for Date within a group) and calculate their difference by subtraction. As this is a time series, the last/first entries should correspond to the max/min entries of Date within the group.

Hence, the result:

   Group  agg(colC)  avg(colA)  avg(colB)  delta(Date)
0      0      101.0        2.0      -5.00            2
1      1      102.0        7.0      15.00            2
2      2      112.0       12.5      16.25            3

Here, the result has a column Group while our original code version without it. This is because we defined a separate Pandas series group for the grouping, rather than defining a helper column Group in this illustration.

Python Pandas aggregate past rows when a condition is met for time series

Answers (2)

Explanation / Illustration

Related Questions