Rui
Rui

Reputation: 13

Pandas: Updating column values efficiently based on a function

I will summarize with a small example of what I am trying to do. Lets say we have a dataframe with two (out of roughly 15) of the columns represented below:


    change  period 
0    -1       1
1    -1       1
2    0.0      1
3    -1       1
4     1       2
5     1       2
6    0.0      2
7    0.0      2
8     1       2
9    -1       3

...
...

And this extends for around 25M data entries.. Essentially, I want to change each 0.0 under the change column in the dataframe to take the value of the direction within it's period (so either -1 or +1, which represent the directions), excluding the first entry in the period.

Currently I have the following I am running, but with so many data entries, I can't have it spending hours:

def getPeriodDirection(period):
    val = df.loc[(df['period'] == period) & (df['change'] != 0.0) , 'change'].median()
    return val


df['change'] = df.apply(lambda row : getPeriodDirection(row['period']) if row['change'] == 0.0 else row['change'] , axis=1)

I tried several things, around using .locs but I just can't get it exactly how I need it. I tried the following:

directionNoChange = df['change'].isin(range(0,1))

df.loc[directionNoChange, 'change'] = getPeriodDirection(df, df['period'])

This solution got me pretty close. I ended up having a dataframe with the original indexes when 'change' = 0.0, and have it updated with the correct value from the function. Based on the example, it would generate:

  change
2   -1
6    1
7    1

Next step would of been to substitute this value from the loc dataframe, on the original dataframe where the indexes lined up. But since I am kind of unfamiliar with the API, I am having a bunch of trouble!

Help would be highly appreciated!

Upvotes: 0

Views: 58

Answers (1)

Ian Thompson
Ian Thompson

Reputation: 3295

If you are wanting to replace the values of 0.0 with the median of the group, you can use .mask to convert the 0.0 values to NaN, then fill them with the median.

print(df)

   change  period
0    -1.0       1
1    -1.0       1
2     0.0       1
3    -1.0       1
4     1.0       2
5     1.0       2
6     0.0       2
7     0.0       2
8     1.0       2
9    -1.0       3

# mask takes a condition and fills the True values with NaN
print(df.change.mask(cond = df.change == 0))

0   -1.0
1   -1.0
2    NaN
3   -1.0
4    1.0
5    1.0
6    NaN
7    NaN
8    1.0
9   -1.0
Name: change, dtype: float64

# use the other parameter similar to a fillna method
df['change'] = df.change.mask(cond = df.change == 0, other = df.groupby('period').change.transform('median'))

print(df)

   change  period
0    -1.0       1
1    -1.0       1
2    -1.0       1
3    -1.0       1
4     1.0       2
5     1.0       2
6     1.0       2
7     1.0       2
8     1.0       2
9    -1.0       3

Upvotes: 1

Related Questions