Reputation: 87
I have a dataframe with 5000 records. I want the null values to be filled with:
Average(All the Preceding values before null, First succeeding value after null)
data:
Date gcs Comp Clay WTS
2020-01-01 1550 41 9.41 22.6
2020-01-02 1540 48 9.50 25.8
2020-01-03 NAN NAN NAN NAN
2020-01-04 1542 42 9.30 23.7
2020-01-05 1580 48 9.10 21.2
2020-01-06 NAN NAN NAN NAN
2020-01-07 1520 40 10 20.2
2020-01-08 1523 30 25 19
Example: For the date 2020-01-03, i want the null value in the gcs column to be filled with the Average(1550,1540,1542) which gives 1544.
1550 and 1540 are the preceding values before null and 1542 is my first succeeding value after null.
Similarly,
For the date 2020-01-06 i want the null values for gcs column to be filled with Average(1550,1540,1544,1542,1580,1520) which gives 1546.
1550 till 1580 are the preceding values before null and 1520 is the first succeeding value after null.
Desired Output:
Date gcs Comp Clay WTS
2020-01-01 1550 41 9.41 22.6
2020-01-02 1540 48 9.50 25.8
2020-01-03 1544 43.66 9.403 24.03
2020-01-04 1542 42 9.30 23.7
2020-01-05 1580 48 9.10 21.2
2020-01-06 1546 43.77 9.45 22.92
2020-01-07 1520 40 10 20.2
2020-01-08 1523 30 25 19
**Edit:
Thanks for the response Tom. i kept my date column as index and tried the below code:
def foo(row):
if any(row.isna()):
df.loc[row.name,row.isna()] = df.expanding().mean().shift(-1).loc[row.name,:]
df.apply(foo, axis=1)
The output that i got is :
Date
2020-01-01 None
2020-01-02 None
2020-01-03 None
2020-01-04 None
2020-01-05 None
2020-01-06 None
2020-01-07 None
2020-01-08 None
dtype: object
Can you please help me figure out what is wrong.
Upvotes: 3
Views: 784
Reputation: 8790
The following seems to work. You define an apply
function for the rows which modifies the df
in place. Each time a row (with null values) is reached you can take an expanding
mean of df
(see here), using a shift
to include the following row. You then use loc
to overwrite df
with the new values:
def foo(row):
if any(row.isna()):
df.loc[row.name,row.isna()] = df.expanding().mean().shift(-1).loc[row.name,:]
Applying:
>>>df.apply(foo, axis=1)
gcs Comp Clay WTS
Date
2020-01-01 1550.0 41.000000 9.410000 22.600000
2020-01-02 1540.0 48.000000 9.500000 25.800000
2020-01-03 1544.0 43.666667 9.403333 24.033333
2020-01-04 1542.0 42.000000 9.300000 23.700000
2020-01-05 1580.0 48.000000 9.100000 21.200000
2020-01-06 1546.0 43.777778 9.452222 22.922222
2020-01-07 1520.0 40.000000 10.000000 20.200000
2020-01-08 1523.0 30.000000 25.000000 19.000000
Note that I moved your Date
column to be an index. I think the above should work wherever the missing values are, ensuring that the values are filled in from top to bottom.
I'm not sure how it will handle scaling up to 5000 rows, but it seems like you have to use apply
or some loop b/c you want to include imputed values in the calculation of future imputed values*. I added the if
statement b/c it seemed to speed up the calculation considerably:
%%timeit
df.apply(foo, axis=1)
#1.17 ms ± 25.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%%timeit
df.apply(foo_without_if, axis=1)
#16.2 ms ± 201 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
*if you don't want to do this (i.e. you can just take the rolling mean but ignore NAs from earlier rows), you can do:
mask = df.isna()
df[mask] = df.expanding().mean()[mask.shift(1)].shift(-1)
Upvotes: 1