Geoffrey Stoel
Geoffrey Stoel

Reputation: 1320

Distribute value equally across NaN in pandas

I have the following dataframe:

                     var_value
2016-07-01 05:10:00      809.0
2016-07-01 05:15:00        NaN
2016-07-01 05:20:00        NaN
2016-07-01 05:25:00        NaN
2016-07-01 05:30:00        NaN
2016-07-01 05:35:00        NaN
2016-07-01 05:40:00        NaN
2016-07-01 05:45:00        NaN
2016-07-01 05:50:00        NaN
2016-07-01 05:55:00        NaN
2016-07-01 06:00:00        NaN
2016-07-01 06:05:00        NaN
2016-07-01 06:10:00      185.0
2016-07-01 06:15:00        NaN
2016-07-01 06:20:00        NaN
2016-07-01 06:25:00        NaN
2016-07-01 06:30:00        NaN
2016-07-01 06:35:00        NaN
2016-07-01 06:40:00        NaN
2016-07-01 06:45:00        NaN
2016-07-01 06:50:00        NaN
2016-07-01 06:55:00        NaN
2016-07-01 07:00:00        NaN
2016-07-01 07:05:00        NaN

I want to distribute the 809.0 and 185.0 evenly across the rows. So my resulting dataframe should look like:

               var_value
7/1/2016 5:10    67.42 
7/1/2016 5:15    67.42 
7/1/2016 5:20    67.42 
7/1/2016 5:25    67.42 
7/1/2016 5:30    67.42 
7/1/2016 5:35    67.42 
7/1/2016 5:40    67.42 
7/1/2016 5:45    67.42 
7/1/2016 5:50    67.42 
7/1/2016 5:55    67.42 
7/1/2016 6:00    67.42 
7/1/2016 6:05    67.42 
7/1/2016 6:10    15.42 
7/1/2016 6:15    15.42 
7/1/2016 6:20    15.42 
7/1/2016 6:25    15.42 
7/1/2016 6:30    15.42 
7/1/2016 6:35    15.42 
7/1/2016 6:40    15.42 
7/1/2016 6:45    15.42 
7/1/2016 6:50    15.42 
7/1/2016 6:55    15.42 
7/1/2016 7:00    15.42 
7/1/2016 7:05    15.42 

The number of rows between the known values that need to be distributed (so the NaNs in this case) can differ. In this case it is nicely 11 unknowns, but it could be 10 or 3 or 7, etc.

Any help on solving this would be very much appreciated.

Upvotes: 2

Views: 1337

Answers (2)

jezrael
jezrael

Reputation: 862761

You can first ffill NaN values and then divide by len with GroupBy.transform:

df['var_value'] = df.var_value.ffill()
df['var_value'] = df['var_value'] / df.groupby('var_value')['var_value'].transform(len)

print (df)
                     var_value
2016-07-01 05:10:00  67.416667
2016-07-01 05:15:00  67.416667
2016-07-01 05:20:00  67.416667
2016-07-01 05:25:00  67.416667
2016-07-01 05:30:00  67.416667
2016-07-01 05:35:00  67.416667
2016-07-01 05:40:00  67.416667
2016-07-01 05:45:00  67.416667
2016-07-01 05:50:00  67.416667
2016-07-01 05:55:00  67.416667
2016-07-01 06:00:00  67.416667
2016-07-01 06:05:00  67.416667
2016-07-01 06:10:00  15.416667
2016-07-01 06:15:00  15.416667
2016-07-01 06:20:00  15.416667
2016-07-01 06:25:00  15.416667
2016-07-01 06:30:00  15.416667
2016-07-01 06:35:00  15.416667
2016-07-01 06:40:00  15.416667
2016-07-01 06:45:00  15.416667
2016-07-01 06:50:00  15.416667
2016-07-01 06:55:00  15.416667
2016-07-01 07:00:00  15.416667
2016-07-01 07:05:00  15.416667

Comparing solutions:

len(df)=24:

In [18]: %timeit (jez(df))
1000 loops, best of 3: 1.18 ms per loop

In [19]: %timeit (pir(df1))
100 loops, best of 3: 2.92 ms per loop

len(df)=24k:

In [21]: %timeit (jez(df))
100 loops, best of 3: 7.49 ms per loop

In [22]: %timeit (pir(df1))
1 loop, best of 3: 590 ms per loop

Code for timings:

#if need comapre 24k
#df = pd.concat([df]*1000).reset_index(drop=True)
df1 = df.copy()
def jez(df):
    df['var_value'] = df.var_value.ffill()
    df['var_value'] = df['var_value'] / df.groupby('var_value')['var_value'].transform(len)
    return df    

def pir(df):
    df = df.fillna(0).groupby(df.var_value.notnull().cumsum()).transform(lambda x: x.mean())
    return df    


print (jez(df))
print (pir(df1))

Upvotes: 3

piRSquared
piRSquared

Reputation: 294328

df.fillna(0).groupby(df.notnull().cumsum()).transform(lambda x: x.mean())

2016-07-01 05:10:00    67.416667
2016-07-01 05:15:00    67.416667
2016-07-01 05:20:00    67.416667
2016-07-01 05:25:00    67.416667
2016-07-01 05:30:00    67.416667
2016-07-01 05:35:00    67.416667
2016-07-01 05:40:00    67.416667
2016-07-01 05:45:00    67.416667
2016-07-01 05:50:00    67.416667
2016-07-01 05:55:00    67.416667
2016-07-01 06:00:00    67.416667
2016-07-01 06:05:00    67.416667
2016-07-01 06:10:00    15.416667
2016-07-01 06:15:00    15.416667
2016-07-01 06:20:00    15.416667
2016-07-01 06:25:00    15.416667
2016-07-01 06:30:00    15.416667
2016-07-01 06:35:00    15.416667
2016-07-01 06:40:00    15.416667
2016-07-01 06:45:00    15.416667
2016-07-01 06:50:00    15.416667
2016-07-01 06:55:00    15.416667
2016-07-01 07:00:00    15.416667
2016-07-01 07:05:00    15.416667
Name: var_value, dtype: float64

Explanation

  • df.notnull().cumsum() creates a series that I can groupby with

  • df.fillna(0) ensures that the NaNs are included as 0 when I calculate the mean

  • transform(lambda x: x.mean()) calculates the lambda function for each element in the group.

df.notnull.cumsum()

2016-07-01 05:10:00    1
2016-07-01 05:15:00    1
2016-07-01 05:20:00    1
2016-07-01 05:25:00    1
2016-07-01 05:30:00    1
2016-07-01 05:35:00    1
2016-07-01 05:40:00    1
2016-07-01 05:45:00    1
2016-07-01 05:50:00    1
2016-07-01 05:55:00    1
2016-07-01 06:00:00    1
2016-07-01 06:05:00    1
2016-07-01 06:10:00    2
2016-07-01 06:15:00    2
2016-07-01 06:20:00    2
2016-07-01 06:25:00    2
2016-07-01 06:30:00    2
2016-07-01 06:35:00    2
2016-07-01 06:40:00    2
2016-07-01 06:45:00    2
2016-07-01 06:50:00    2
2016-07-01 06:55:00    2
2016-07-01 07:00:00    2
2016-07-01 07:05:00    2
Name: var_value, dtype: int64

Upvotes: 3

Related Questions