Reputation: 13
I will summarize with a small example of what I am trying to do. Lets say we have a dataframe with two (out of roughly 15) of the columns represented below:
change period
0 -1 1
1 -1 1
2 0.0 1
3 -1 1
4 1 2
5 1 2
6 0.0 2
7 0.0 2
8 1 2
9 -1 3
...
...
And this extends for around 25M data entries.. Essentially, I want to change each 0.0 under the change column in the dataframe to take the value of the direction within it's period (so either -1 or +1, which represent the directions), excluding the first entry in the period.
Currently I have the following I am running, but with so many data entries, I can't have it spending hours:
def getPeriodDirection(period):
val = df.loc[(df['period'] == period) & (df['change'] != 0.0) , 'change'].median()
return val
df['change'] = df.apply(lambda row : getPeriodDirection(row['period']) if row['change'] == 0.0 else row['change'] , axis=1)
I tried several things, around using .locs but I just can't get it exactly how I need it. I tried the following:
directionNoChange = df['change'].isin(range(0,1))
df.loc[directionNoChange, 'change'] = getPeriodDirection(df, df['period'])
This solution got me pretty close. I ended up having a dataframe with the original indexes when 'change' = 0.0, and have it updated with the correct value from the function. Based on the example, it would generate:
change
2 -1
6 1
7 1
Next step would of been to substitute this value from the loc dataframe, on the original dataframe where the indexes lined up. But since I am kind of unfamiliar with the API, I am having a bunch of trouble!
Help would be highly appreciated!
Upvotes: 0
Views: 58
Reputation: 3295
If you are wanting to replace the values of 0.0
with the median of the group, you can use .mask
to convert the 0.0
values to NaN
, then fill them with the median
.
print(df)
change period
0 -1.0 1
1 -1.0 1
2 0.0 1
3 -1.0 1
4 1.0 2
5 1.0 2
6 0.0 2
7 0.0 2
8 1.0 2
9 -1.0 3
# mask takes a condition and fills the True values with NaN
print(df.change.mask(cond = df.change == 0))
0 -1.0
1 -1.0
2 NaN
3 -1.0
4 1.0
5 1.0
6 NaN
7 NaN
8 1.0
9 -1.0
Name: change, dtype: float64
# use the other parameter similar to a fillna method
df['change'] = df.change.mask(cond = df.change == 0, other = df.groupby('period').change.transform('median'))
print(df)
change period
0 -1.0 1
1 -1.0 1
2 -1.0 1
3 -1.0 1
4 1.0 2
5 1.0 2
6 1.0 2
7 1.0 2
8 1.0 2
9 -1.0 3
Upvotes: 1