Bogdan Doicin
Bogdan Doicin

Reputation: 2416

Conditional backfill of Pandas columns, another variation

I have the following Dataframe:

Track FGrating HorseId Last FGrating at Happy Valley Grass
Happy Valley grass 97 22609
Happy Valley grass 106 22609 97
Happy Valley grass 104 22609 106
Happy Valley grass 102 22609 104
Happy Valley grass 95 22609 102
Sha Tin grass 108 22609
Sha Tin grass 104 22609
Happy Valley grass 107 22609 95
Sha Tin grass 102 22609
Happy Valley grass 108 22609 107

I need to fill the empty cells of the rightmost column according to these two rules:

  1. If the horse didn't race on the particular track yet (Happy Valley grass, in this example), then the value to be filled is 0;
  2. Between two races at the particular track (Happy Valley grass, in this example), the value to be filled is the last FGrating on the track in question (the two consecutive rows with Sha Tin grass will get the value 95 and the third one will get 107).

The end result will be like this:

Track FGrating HorseId Last FGrating at Happy Valley Grass
Happy Valley grass 97 22609 0 (rule 1)
Happy Valley grass 106 22609 97
Happy Valley grass 104 22609 106
Happy Valley grass 102 22609 104
Happy Valley grass 95 22609 102
Sha Tin grass 108 22609 95 (rule 2)
Sha Tin grass 104 22609 95 (rule 2)
Happy Valley grass 107 22609 95
Sha Tin grass 102 22609 107 (rule 2)
Happy Valley grass 108 22609 107

I need this for every HorseId on the Dataframe

I tried doing a backfill then filling with 0, something like this:

df['Last FGrating at Happy Valley Grass'] = df.groupby('HorseId')['Last FGrating at Happy Valley Grass'].apply(
        lambda x: x.fillna(method='bfill').fillna(0))

but it backfills everything, ignoring rule 1.

Can you help me here? Doing some research here, I have the feeling that it is a bit more complicated than a string of fillna.

Upvotes: 2

Views: 176

Answers (1)

Timus
Timus

Reputation: 11321

Following up on the information from the comments, I'd propose something like:

def fill(sdf, track):
    mask = sdf['Track'].eq(track).cummax()
    if mask.any():
        first = mask[mask].index[0]
        if pd.isna(sdf.at[first, 'Last FGrating at Happy Valley Grass']):
            sdf.loc[~mask, 'Last FGrating at Happy Valley Grass'] = 0
            sdf.at[first, 'Last FGrating at Happy Valley Grass'] = 0 
    sdf['Last FGrating at Happy Valley Grass'] = (
        sdf['Last FGrating at Happy Valley Grass'].fillna(method='bfill')
    )
    return sdf

track = "Sha Tin grass"
df[track] = (
    df.groupby('HorseId').apply(fill, track)['Last FGrating at Happy Valley Grass']
)

Since the operation depends on the track, I've made it an argument for the fill function. I'd also put the result into a new column (here named after the chosen track) to not override the base column.

Variation, which I like better (using numerical access):

def fill(sdf, track):
    col = sdf.columns.to_list().index('Last FGrating at Happy Valley Grass')
    mask = sdf['Track'].eq(track)
    if mask.any():
        first = mask.values.argmax()
        if pd.isna(sdf.iat[first, col]):
            sdf.iloc[:first + 1, col] = 0
    sdf.iloc[:, col] = sdf.iloc[:, col].fillna(method='bfill')
    return sdf

EDIT: Removed the first insufficient attempts, made slight adjustments to current answer, and added a variation.

Upvotes: 3

Related Questions