Mike
Mike

Reputation: 155

Get highest value in row, but only under specific condition from previous rows

I have this dataframe:

                        1     2     3
datetime                             
2021-01-23 00:01:00  10.0  11.0  15.0
2021-01-23 00:02:00  12.0  10.0   NaN
2021-01-23 00:03:00  16.0  12.0   NaN

(In reality, there are more columns like 1-3...)

I want to add two columns like this:

                        1     2     3  new_max  new_max_col
datetime                             
2021-01-23 00:01:00  10.0  11.0  15.0     15.0            3
2021-01-23 00:02:00  12.0  10.0   NaN      NaN          NaN
2021-01-23 00:03:00  16.0  12.0   NaN     16.0            1

Actions needed:

My problem: If I would simply use .fillna(method='ffill') before comparison, I would get 15.0 in new_max for the second row, but this would be wrong. Only "new" high values should be copied to new_max, while values from .fillna should be used for comparison only.

Code to create the original dataframe:

import io, pandas as pd, numpy as np

t = io.StringIO("""
datetime|1|2|3
2021-01-23 00:01:00|10|11|15
2021-01-23 00:02:00|12|10|NaN
2021-01-23 00:03:00|16|12|NaN""")
df = pd.read_csv(t, sep='|', parse_dates=['datetime'], dtype=np.float64).set_index('datetime')

!!! EDIT !!!

Here is more and different testing data to describe the problem:

                        1     2     3
datetime                             
2021-01-23 00:00:00  20.0   NaN  15.0
2021-01-23 00:01:00  12.0   NaN   NaN
2021-01-23 00:02:00  10.0  17.0   NaN
2021-01-23 00:03:00   NaN  14.0  18.0
2021-01-23 00:04:00  16.0  12.0   NaN

Note that there could be more than 2 consecutive NaN in a column (not in this example).

Result should be:

                        1     2     3 new_max new_max_col
datetime                                                 
2021-01-23 00:00:00  20.0   NaN  15.0    20.0           1
2021-01-23 00:01:00  12.0   NaN   NaN     NaN         NaN
2021-01-23 00:02:00  10.0  17.0   NaN    17.0           2
2021-01-23 00:03:00   NaN  14.0  18.0    18.0           3
2021-01-23 00:04:00  16.0  12.0   NaN     NaN         NaN

- Row 0: 20 is the highest.

- Row 1: Compare 12 and 15 (15 is lookback from first row in column "3"). 15 is the highest, but it is not "new" (it comes from looking back to the last number before NaNs in the column), so the result is NaN.

- Row 2: Compare 10, 17, 15 (15 is lookback). 17 is the highest.

- Row 3: Compare 10 (lookback), 14, 18. 18 is the highest.

- Row 4: Compare 16, 12, 18 (lookback). 18 is the highest but it's from lookback, so result is NaN.

Code to generate new testing dataframe:

import io, pandas as pd, numpy as np

t = io.StringIO("""
datetime|1|2|3
2021-01-23 00:00:00| 20|NaN| 15
2021-01-23 00:01:00| 12|NaN|NaN
2021-01-23 00:02:00| 10| 17|NaN
2021-01-23 00:03:00|NaN| 14| 18
2021-01-23 00:04:00| 16| 12|NaN""")
df = pd.read_csv(t, sep='|', parse_dates=['datetime'], dtype=np.float64).set_index('datetime')

Upvotes: 2

Views: 214

Answers (1)

jezrael
jezrael

Reputation: 863681

If first row has no missing values is possible compare by Series.cummax with rows with missing values and set to NaNs:

df1=df.agg(['max','idxmax'],axis=1).rename(columns={'idxmax':'max_col'}).add_prefix('new_')

mask = df1['new_max'].cummax().ne(df1['new_max']) & df.isna().any(axis=1)

df1 = df.join(df1.mask(mask))
print (df1)
                        1     2     3 new_max new_max_col
datetime                                                 
2021-01-23 00:01:00  10.0  11.0  15.0    15.0           3
2021-01-23 00:02:00  12.0  10.0   NaN     NaN         NaN
2021-01-23 00:03:00  16.0  12.0   NaN    16.0           1

EDIT:

#forward fill NaNs
df11 = df.ffill()
df1 = (df11.agg(['max','idxmax'],axis=1)
           .rename(columns={'idxmax':'max_col'})
           .add_prefix('new_'))

#comapre only ffill values by new_max
mask = df11.where(df.isna()).max(axis=1).eq(df1['new_max'])

df1 = df.join(df1.mask(mask))
print (df1)
                        1     2     3 new_max new_max_col
datetime                                                 
2021-01-23 00:00:00  20.0   NaN  15.0    20.0           1
2021-01-23 00:01:00  12.0   NaN   NaN     NaN         NaN
2021-01-23 00:02:00  10.0  17.0   NaN    17.0           2
2021-01-23 00:03:00   NaN  14.0  18.0    18.0           3
2021-01-23 00:04:00  16.0  12.0   NaN     NaN         NaN

EDIT1: If need also compare original max values add another mask:

print (df)
                        1     2     3
datetime                             
2021-01-23 00:00:00  20.0   NaN  15.0
2021-01-23 00:01:00  12.0   NaN   NaN
2021-01-23 00:02:00  10.0  17.0   NaN
2021-01-23 00:03:00   NaN  14.0  18.0
2021-01-23 00:04:00  18.0  12.0   NaN <- changed value to 18

#forward fill NaNs
df11 = df.ffill()
df1 = (df11.agg(['max','idxmax'],axis=1)
           .rename(columns={'idxmax':'max_col'})
           .add_prefix('new_'))

#compare only ffill values by new_max
mask = df11.where(df.isna()).max(axis=1).eq(df1['new_max'])
mask1 = df1['new_max'].ne(df.max(axis=1))
df1 = df.join(df1.mask(mask & mask1))
print (df1)
                        1     2     3 new_max new_max_col
datetime                                                 
2021-01-23 00:00:00  20.0   NaN  15.0    20.0           1
2021-01-23 00:01:00  12.0   NaN   NaN     NaN         NaN
2021-01-23 00:02:00  10.0  17.0   NaN    17.0           2
2021-01-23 00:03:00   NaN  14.0  18.0    18.0           3
2021-01-23 00:04:00  18.0  12.0   NaN    18.0           1

Upvotes: 4

Related Questions