Reputation: 1320
I have the following dataframe:
var_value
2016-07-01 05:10:00 809.0
2016-07-01 05:15:00 NaN
2016-07-01 05:20:00 NaN
2016-07-01 05:25:00 NaN
2016-07-01 05:30:00 NaN
2016-07-01 05:35:00 NaN
2016-07-01 05:40:00 NaN
2016-07-01 05:45:00 NaN
2016-07-01 05:50:00 NaN
2016-07-01 05:55:00 NaN
2016-07-01 06:00:00 NaN
2016-07-01 06:05:00 NaN
2016-07-01 06:10:00 185.0
2016-07-01 06:15:00 NaN
2016-07-01 06:20:00 NaN
2016-07-01 06:25:00 NaN
2016-07-01 06:30:00 NaN
2016-07-01 06:35:00 NaN
2016-07-01 06:40:00 NaN
2016-07-01 06:45:00 NaN
2016-07-01 06:50:00 NaN
2016-07-01 06:55:00 NaN
2016-07-01 07:00:00 NaN
2016-07-01 07:05:00 NaN
I want to distribute the 809.0 and 185.0 evenly across the rows. So my resulting dataframe should look like:
var_value
7/1/2016 5:10 67.42
7/1/2016 5:15 67.42
7/1/2016 5:20 67.42
7/1/2016 5:25 67.42
7/1/2016 5:30 67.42
7/1/2016 5:35 67.42
7/1/2016 5:40 67.42
7/1/2016 5:45 67.42
7/1/2016 5:50 67.42
7/1/2016 5:55 67.42
7/1/2016 6:00 67.42
7/1/2016 6:05 67.42
7/1/2016 6:10 15.42
7/1/2016 6:15 15.42
7/1/2016 6:20 15.42
7/1/2016 6:25 15.42
7/1/2016 6:30 15.42
7/1/2016 6:35 15.42
7/1/2016 6:40 15.42
7/1/2016 6:45 15.42
7/1/2016 6:50 15.42
7/1/2016 6:55 15.42
7/1/2016 7:00 15.42
7/1/2016 7:05 15.42
The number of rows between the known values that need to be distributed (so the NaNs in this case) can differ. In this case it is nicely 11 unknowns, but it could be 10 or 3 or 7, etc.
Any help on solving this would be very much appreciated.
Upvotes: 2
Views: 1337
Reputation: 862761
You can first ffill
NaN
values and then divide by len
with GroupBy.transform
:
df['var_value'] = df.var_value.ffill()
df['var_value'] = df['var_value'] / df.groupby('var_value')['var_value'].transform(len)
print (df)
var_value
2016-07-01 05:10:00 67.416667
2016-07-01 05:15:00 67.416667
2016-07-01 05:20:00 67.416667
2016-07-01 05:25:00 67.416667
2016-07-01 05:30:00 67.416667
2016-07-01 05:35:00 67.416667
2016-07-01 05:40:00 67.416667
2016-07-01 05:45:00 67.416667
2016-07-01 05:50:00 67.416667
2016-07-01 05:55:00 67.416667
2016-07-01 06:00:00 67.416667
2016-07-01 06:05:00 67.416667
2016-07-01 06:10:00 15.416667
2016-07-01 06:15:00 15.416667
2016-07-01 06:20:00 15.416667
2016-07-01 06:25:00 15.416667
2016-07-01 06:30:00 15.416667
2016-07-01 06:35:00 15.416667
2016-07-01 06:40:00 15.416667
2016-07-01 06:45:00 15.416667
2016-07-01 06:50:00 15.416667
2016-07-01 06:55:00 15.416667
2016-07-01 07:00:00 15.416667
2016-07-01 07:05:00 15.416667
Comparing solutions:
len(df)=24
:
In [18]: %timeit (jez(df))
1000 loops, best of 3: 1.18 ms per loop
In [19]: %timeit (pir(df1))
100 loops, best of 3: 2.92 ms per loop
len(df)=24k
:
In [21]: %timeit (jez(df))
100 loops, best of 3: 7.49 ms per loop
In [22]: %timeit (pir(df1))
1 loop, best of 3: 590 ms per loop
Code for timings:
#if need comapre 24k
#df = pd.concat([df]*1000).reset_index(drop=True)
df1 = df.copy()
def jez(df):
df['var_value'] = df.var_value.ffill()
df['var_value'] = df['var_value'] / df.groupby('var_value')['var_value'].transform(len)
return df
def pir(df):
df = df.fillna(0).groupby(df.var_value.notnull().cumsum()).transform(lambda x: x.mean())
return df
print (jez(df))
print (pir(df1))
Upvotes: 3
Reputation: 294328
df.fillna(0).groupby(df.notnull().cumsum()).transform(lambda x: x.mean())
2016-07-01 05:10:00 67.416667
2016-07-01 05:15:00 67.416667
2016-07-01 05:20:00 67.416667
2016-07-01 05:25:00 67.416667
2016-07-01 05:30:00 67.416667
2016-07-01 05:35:00 67.416667
2016-07-01 05:40:00 67.416667
2016-07-01 05:45:00 67.416667
2016-07-01 05:50:00 67.416667
2016-07-01 05:55:00 67.416667
2016-07-01 06:00:00 67.416667
2016-07-01 06:05:00 67.416667
2016-07-01 06:10:00 15.416667
2016-07-01 06:15:00 15.416667
2016-07-01 06:20:00 15.416667
2016-07-01 06:25:00 15.416667
2016-07-01 06:30:00 15.416667
2016-07-01 06:35:00 15.416667
2016-07-01 06:40:00 15.416667
2016-07-01 06:45:00 15.416667
2016-07-01 06:50:00 15.416667
2016-07-01 06:55:00 15.416667
2016-07-01 07:00:00 15.416667
2016-07-01 07:05:00 15.416667
Name: var_value, dtype: float64
df.notnull().cumsum()
creates a series that I can groupby
with
df.fillna(0)
ensures that the NaN
s are included as 0
when I calculate the mean
transform(lambda x: x.mean())
calculates the lambda
function for each element in the group.
df.notnull.cumsum()
2016-07-01 05:10:00 1
2016-07-01 05:15:00 1
2016-07-01 05:20:00 1
2016-07-01 05:25:00 1
2016-07-01 05:30:00 1
2016-07-01 05:35:00 1
2016-07-01 05:40:00 1
2016-07-01 05:45:00 1
2016-07-01 05:50:00 1
2016-07-01 05:55:00 1
2016-07-01 06:00:00 1
2016-07-01 06:05:00 1
2016-07-01 06:10:00 2
2016-07-01 06:15:00 2
2016-07-01 06:20:00 2
2016-07-01 06:25:00 2
2016-07-01 06:30:00 2
2016-07-01 06:35:00 2
2016-07-01 06:40:00 2
2016-07-01 06:45:00 2
2016-07-01 06:50:00 2
2016-07-01 06:55:00 2
2016-07-01 07:00:00 2
2016-07-01 07:05:00 2
Name: var_value, dtype: int64
Upvotes: 3