Pandas fill nan values with a split of the first available value

Question

I'm trying to replace nan values in a DataFrame with a split of the first previous available value across all the following nan values.

In the example below :

import pandas as pd
df = [100, None, None, 40, None, 120]
df = pd.DataFrame(df)

I would like to get :

[33.33, 33.33, 33.33, 20, 20, 120]

If I could find a way to count the number of nan values following each value in my column, then I could run some computations to achieve the split.

jezrael · Accepted Answer

Use:

import pandas as pd
df = [100, None, None, 40, None, 120]
df = pd.DataFrame(df, columns=['a'])

s = df['a'].ffill() / df.groupby(df['a'].notna().cumsum())['a'].transform('size')
print (s)

0     33.333333
1     33.333333
2     33.333333
3     20.000000
4     20.000000
5    120.000000
Name: a, dtype: float64

Details:

You can replace missing value by previous non NaNs values by ffill:

print (df['a'].ffill())
0    100.0
1    100.0
2    100.0
3     40.0
4     40.0
5    120.0
Name: a, dtype: float64

Then compare by Series.notna and create groups by Series.cumsum:

print (df['a'].notna().cumsum())
0    1
1    1
2    1
3    2
4    2
5    3
Name: a, dtype: int32

And get counts per groups with same size like original with GroupBy.transform:

print (df.groupby(df['a'].notna().cumsum())['a'].transform('size'))
0    3
1    3
2    3
3    2
4    2
5    1
Name: a, dtype: int64

Pandas fill nan values with a split of the first available value

Answers (2)

Related Questions