Jun Seong Jang
Jun Seong Jang

Reputation: 395

Dynamically replace NaN values with the average of previous and next non-missing values

I have a dataframe df with NaN values and I want to dynamically replace them with the average values of previous and next non-missing values.

In [27]: df 
Out[27]: 
          A         B         C
0 -0.166919  0.979728 -0.632955
1 -0.297953 -0.912674 -1.365463
2 -0.120211 -0.540679 -0.680481
3       NaN -2.027325  1.533582
4       NaN       NaN  0.461821
5 -0.788073       NaN       NaN
6 -0.916080 -0.612343       NaN
7 -0.887858  1.033826       NaN
8  1.948430  1.025011 -2.982224
9  0.019698 -0.795876 -0.046431

For example, A[3] is NaN so its value should be (-0.120211-0.788073)/2 = -0.454142. A[4] then should be (-0.454142-0.788073)/2 = -0.621108.

Therefore, the result dataframe should look like:

In [27]: df 
Out[27]: 
          A         B         C
0 -0.166919  0.979728 -0.632955
1 -0.297953 -0.912674 -1.365463
2 -0.120211 -0.540679 -0.680481
3 -0.454142 -2.027325  1.533582
4 -0.621108 -1.319834  0.461821
5 -0.788073 -0.966089 -1.260202
6 -0.916080 -0.612343 -2.121213
7 -0.887858  1.033826 -2.551718
8  1.948430  1.025011 -2.982224
9  0.019698 -0.795876 -0.046431

Is this a good way to deal with the missing values? I can't simply replace them by the average values of each column because my data is time-series and tends to increase over time. (The initial value may be $0 and final value might be $100000, so the average is $50000 which can be much bigger/smaller than the NaN values).

Upvotes: 6

Views: 5500

Answers (5)

franco martinez
franco martinez

Reputation: 1

I had the same problem! My solution was the following:

def prev_next_notNAN(serie, index):
#this function receives a pandas series an the index of NAN.
#And return the index of the previous and next NAN value index.
      prev_notNAN_index = serie[:index].dropna().index[-1]
      next_notNAN_index = serie[index:].dropna().index[0]
      return prev_notNAN_index, next_notNAN_index

def fill_nan_with_mean_from_prev_and_next(df, NAN_column: str):
#this function receives a pandas dataframe and the column name that you want to fill
     NANrows = pd.isnull(df).any(axis='columns').to_numpy().nonzero()[0]
     for row in NANrows:
        prev_index, next_index = prev_next_notNAN(df[NAN_column], row)
        df.at[row,NAN_column] = (df.iloc[prev_index][NAN_column]+df.iloc[next_index][NAN_column])/2
     return df

These two functions let you fill a NAN value with the simple mean between its previous and next non-NAN values. Here is an example:

[in]: values = np.array([
 [-0.166919,  0.979728, -0.632955],
 [-0.297953, -0.912674, -1.365463],
 [-0.120211, -0.540679, -0.680481],
 [   np.nan, -2.027325,  1.533582],
 [   np.nan,    np.nan,  0.461821],
 [-0.788073,    np.nan,    np.nan],
 [-0.916080, -0.612343,    np.nan],
 [-0.887858,  1.033826,    np.nan],
  [1.948430,  1.025011, -2.982224],
 [ 0.019698, -0.795876, -0.046431]])
df = pd.DataFrame(values, columns = ["A","B","C"])

colA_clean = fill_nan_with_mean_from_prev_and_next(df, "A")
Out:
          A         B         C
0 -0.166919  0.979728 -0.632955
1 -0.297953 -0.912674 -1.365463
2 -0.120211 -0.540679 -0.680481
3 -0.454142 -2.027325  1.533582
4 -0.621108       NaN  0.461821
5 -0.704590       NaN       NaN
6 -0.810335 -0.612343       NaN
7 -0.849097  1.033826       NaN
8  1.948430  1.025011 -2.982224
9  0.019698 -0.795876 -0.046431

Upvotes: 0

RAGUL G
RAGUL G

Reputation: 1

import pandas as pd
import numpy as np

data = {
    'A': [-0.166919, -0.297953, -0.120211, np.nan, np.nan, -0.788073, -0.916080, -0.887858, 1.948430, 0.019698],
    'B': [0.979728, -0.912674, -0.540679, -2.027325, np.nan, np.nan, -0.612343, 1.033826, 1.025011, -0.795876],
    'C': [-0.632955, -1.365463, -0.680481, 1.533582, 0.461821, np.nan, np.nan, np.nan, -2.982224, -0.046431]
}
df = pd.DataFrame(data)

def replace_nan_with_neighbors_avg(df):
    # Iterate over columns
    for col in df.columns:
        # Get indices of NaN values
        nan_indices = df[df[col].isna()].index
        # Iterate over NaN indices
        for i in nan_indices:
            # Find previous non-NaN value
            prev_val = df[col][:i].dropna().iloc[-1] if not df[col][:i].dropna().empty else np.nan
            # Find next non-NaN value
            next_val = df[col][i+1:].dropna().iloc[0] if not df[col][i+1:].dropna().empty else np.nan
            # Calculate average and replace NaN
            if not np.isnan(prev_val) and not np.isnan(next_val):
                df.at[i, col] = (prev_val + next_val) / 2
            elif not np.isnan(prev_val):  
                df.at[i, col] = prev_val
            elif not np.isnan(next_val):  
                df.at[i, col] = next_val
    return df
df_filled = replace_nan_with_neighbors_avg(df)
print(df_filled)

Upvotes: 0

samim-sh
samim-sh

Reputation: 1

as Ben.T has mentioned above

if you have another group of NaN in the same column

you can consider this lazy solution :)

for column in df:
    for ind,row in df[[column]].iterrows():
        if ~np.isnan(row[column]):
            previous = row[column]
        else:
            indx = ind + 1
            while np.isnan(df.loc[indx,column]):
                indx += 1
            next = df.loc[indx,column]
            previous = df[column][ind] = (previous + next)/2

Upvotes: 0

Björn P
Björn P

Reputation: 1

Got a simular Problem. The following code worked for me.

def fill_nan_with_mean_from_prev_and_next(df):
        NANrows = pd.isnull(df).any(1).nonzero()[0]
        null_df = df.isnull()
        for row in NANrows :
            for colum in range(0,df.shape[1]):
                if(null_df.iloc[row][colum]):
                    df.iloc[row][colum] = (df.iloc[row-1][colum]+df.iloc[row-1][colum])/2

        return df

maybe it is helps someone too.

Upvotes: 0

BENY
BENY

Reputation: 323226

You can try to understand your logic behind the average that is Geometric progression

s=df.isnull().cumsum()
t1=df[(s==1).shift(-1).fillna(False)].stack().reset_index(level=0,drop=True)
t2=df.lookup(s.idxmax()+1,s.idxmax().index)
df.fillna(t1/(2**s)+t2*(1-0.5**s)*2/2)
Out[212]: 
          A         B         C
0 -0.166919  0.979728 -0.632955
1 -0.297953 -0.912674 -1.365463
2 -0.120211 -0.540679 -0.680481
3 -0.454142 -2.027325  1.533582
4 -0.621107 -1.319834  0.461821
5 -0.788073 -0.966089 -1.260201
6 -0.916080 -0.612343 -2.121213
7 -0.887858  1.033826 -2.551718
8  1.948430  1.025011 -2.982224
9  0.019698 -0.795876 -0.046431

Explanation:

1st NaN x/2+y/2=1st

2nd NaN 1st/2+y/2=2nd

3rd NaN 2nd/2+y/2+3rd

Then x/(2**n)+y(1-(1/2)**n)/(1-1/2), this is the key

Upvotes: 2

Related Questions