camphor
camphor

Reputation: 71

How can I fill NaN values in a dataframe with the average of the values above it?

I'm looking to make it so that NaN values in a dataframe are filled in by the mean of all the values up to that point, as such:

  A       
0 1
1 2
2 3
3 4
4 5
5 NaN  
6 NaN
7 11
8 NaN

Would become

 A       
0 1
1 2
2 3
3 4
4 5
5 3
6 3
7 11
8 4

Upvotes: 2

Views: 250

Answers (2)

RichieV
RichieV

Reputation: 5183

Marco's answer works fine but it can be optimized with incremental average formulas, from math.stackexchange.com

Here is an adaptation of that other question (not the exact formula, just the concept).

cumsum = 0
expanding_mean = []
for i, xi in enumerate(df['A']):
    if pd.isna(xi):
        mean = cumsum / i # divide by number of items up to previous row
        expanding_mean.append(mean)
        cumsum += mean
    else:
        cumsum += xi

df.loc[df['A'].isna(), 'A'] = expanding_mean

The main advantage with this code is not having to read all items up to the current index on each iteration to get the mean.

This option still uses a python loop--which is not the best choice with pandas--but there seems to be no way around it for this use case (hopefully someone will get inspired by this and find such method without a loop).


Performance tests

Three alternative functions were defined:

  • incremental: My answer.
  • from_origin: Marco's original answer.
  • incremental_pandas: Marco's updated answer.

Tests were done using timeit module with 3 repetitions on random samples with 0.4 probability of NaN.

enter image description here

Full code for testing

import pandas as pd
import numpy as np
import timeit
import collections
from matplotlib import pyplot as plt


def incremental(df: pd.DataFrame):
    # error handling
    if pd.isna(df.iloc[0, 0]):
        df.iloc[0, 0] = 0
    
    cumsum = 0
    expanding_mean = []
    for i, xi in enumerate(df['A']):
        if pd.isna(xi):
            mean = cumsum / i # divide by number of items up to previous row
            expanding_mean.append(mean)
            cumsum += mean
        else:
            cumsum += xi
    df.loc[df['A'].isna(), 'A'] = expanding_mean
    return df

def incremental_pandas(df: pd.DataFrame):
    # error handling
    if pd.isna(df.iloc[0, 0]):
        df.iloc[0, 0] = 0

    last_idx = None
    cumsum = 0
    cumnum = 0
    for idx in df[pd.isna(df["A"])].index:
        prev_values = df.loc[ last_idx : idx, "A" ]
        # for some reason, pandas includes idx on the slice, so we remove it
        prev_values = prev_values[ : -1 ]
        cumsum += prev_values.sum()
        cumnum += len(prev_values) 
        df.loc[idx, "A"] = cumsum / cumnum
        last_idx = idx
        
    return df

def from_origin(df: pd.DataFrame):
    # error handling
    if pd.isna(df.iloc[0, 0]):
        df.iloc[0, 0] = 0

    for idx in df[pd.isna(df["A"])].index:
        df.loc[idx, "A"] = np.mean(df.loc[ : idx, "A" ])

    return df


def get_random_sample(n, p):
    np.random.seed(123)
    return pd.DataFrame({'A': 
        np.random.choice(list(range(10)) + [np.nan],
            size=n, p=[(1 - p) / 10] * 10 + [p])})


r = 3
p = 0.4 # portion of NaNs

# check result from all functions
results = []
for func in [from_origin, incremental, incremental_pandas]:
    random_df = get_random_sample(1000, p)
    new_df = random_df.copy(deep=True)
    results.append(func(new_df))

print('Passed' if all(np.allclose(r, results[0]) for r in results[1:])
    else 'Failed', 'implementation test')



timings = {}
for n in np.geomspace(10, 10000, 10):
    random_df = get_random_sample(int(n), p)
    timings[n] = collections.defaultdict(float)
    results = {}
    for func in ['incremental', 'from_origin', 'incremental_pandas']:
        timings[n][func] = (
            timeit.timeit(f'{func}(random_df.copy(deep=True))', number=r, globals=globals())
            / r
        )

timings = pd.DataFrame(timings).T
print(timings)

timings.plot()
plt.xlabel('size of array')
plt.ylabel('avg runtime (s)')
plt.ylim(0)
plt.grid(True)
plt.tight_layout()
plt.show()
plt.close('all')

Upvotes: 1

Marco
Marco

Reputation: 887

You can solve it by running the following code

import numpy as np
import pandas as pd

df = pd.DataFrame({
  "A": [ 1, 2, 3, 4, 5, pd.NA, pd.NA, 11, pd.NA ]
})

for idx in df[pd.isna(df["A"])].index:
    df.loc[idx, "A"] = np.mean(df.loc[ : idx, "A" ])

It iterates on each NaN and fills it with the mean of the previous values, including those filled NaNs.

At the end you will have:

>>> df
    A
0   1
1   2
2   3
3   4
4   5
5   3
6   3
7  11
8   4

EDIT

As stated by RichieV, performance may be an issue with this solution (its runtime complexity is O(N^2)) when there are many NaNs, but we also should avoid python iterations, since they are slow when compared to native pandas / numpy calls.

Here is an optimized version:

last_idx = None
cumsum = 0
cumnum = 0

for idx in df[pd.isna(df["A"])].index:
    prev_values = df.loc[ last_idx : idx, "A" ]
    # for some reason, pandas includes idx on the slice, so we remove it
    prev_values = prev_values[ : -1 ]
    cumsum += prev_values.sum()
    cumnum += len(prev_values) 
    df.loc[idx, "A"] = int(cumsum / cumnum)
    last_idx = idx

Result:

>>> df
    A
0   1
1   2
2   3
3   4
4   5
5   3
6   3
7  11
8   4

Since in the worst case the script should pass on the dataframe twice, the runtime complexity is now O(N).

Upvotes: 1

Related Questions