Reputation: 395
I have a dataframe df
with NaN
values and I want to dynamically replace them with the average values of previous and next non-missing values.
In [27]: df
Out[27]:
A B C
0 -0.166919 0.979728 -0.632955
1 -0.297953 -0.912674 -1.365463
2 -0.120211 -0.540679 -0.680481
3 NaN -2.027325 1.533582
4 NaN NaN 0.461821
5 -0.788073 NaN NaN
6 -0.916080 -0.612343 NaN
7 -0.887858 1.033826 NaN
8 1.948430 1.025011 -2.982224
9 0.019698 -0.795876 -0.046431
For example, A[3]
is NaN
so its value should be (-0.120211-0.788073)/2 = -0.454142. A[4]
then should be (-0.454142-0.788073)/2 = -0.621108.
Therefore, the result dataframe should look like:
In [27]: df
Out[27]:
A B C
0 -0.166919 0.979728 -0.632955
1 -0.297953 -0.912674 -1.365463
2 -0.120211 -0.540679 -0.680481
3 -0.454142 -2.027325 1.533582
4 -0.621108 -1.319834 0.461821
5 -0.788073 -0.966089 -1.260202
6 -0.916080 -0.612343 -2.121213
7 -0.887858 1.033826 -2.551718
8 1.948430 1.025011 -2.982224
9 0.019698 -0.795876 -0.046431
Is this a good way to deal with the missing values? I can't simply replace them by the average values of each column because my data is time-series and tends to increase over time. (The initial value may be $0 and final value might be $100000, so the average is $50000 which can be much bigger/smaller than the NaN values).
Upvotes: 6
Views: 5500
Reputation: 1
I had the same problem! My solution was the following:
def prev_next_notNAN(serie, index):
#this function receives a pandas series an the index of NAN.
#And return the index of the previous and next NAN value index.
prev_notNAN_index = serie[:index].dropna().index[-1]
next_notNAN_index = serie[index:].dropna().index[0]
return prev_notNAN_index, next_notNAN_index
def fill_nan_with_mean_from_prev_and_next(df, NAN_column: str):
#this function receives a pandas dataframe and the column name that you want to fill
NANrows = pd.isnull(df).any(axis='columns').to_numpy().nonzero()[0]
for row in NANrows:
prev_index, next_index = prev_next_notNAN(df[NAN_column], row)
df.at[row,NAN_column] = (df.iloc[prev_index][NAN_column]+df.iloc[next_index][NAN_column])/2
return df
These two functions let you fill a NAN value with the simple mean between its previous and next non-NAN values. Here is an example:
[in]: values = np.array([
[-0.166919, 0.979728, -0.632955],
[-0.297953, -0.912674, -1.365463],
[-0.120211, -0.540679, -0.680481],
[ np.nan, -2.027325, 1.533582],
[ np.nan, np.nan, 0.461821],
[-0.788073, np.nan, np.nan],
[-0.916080, -0.612343, np.nan],
[-0.887858, 1.033826, np.nan],
[1.948430, 1.025011, -2.982224],
[ 0.019698, -0.795876, -0.046431]])
df = pd.DataFrame(values, columns = ["A","B","C"])
colA_clean = fill_nan_with_mean_from_prev_and_next(df, "A")
Out:
A B C
0 -0.166919 0.979728 -0.632955
1 -0.297953 -0.912674 -1.365463
2 -0.120211 -0.540679 -0.680481
3 -0.454142 -2.027325 1.533582
4 -0.621108 NaN 0.461821
5 -0.704590 NaN NaN
6 -0.810335 -0.612343 NaN
7 -0.849097 1.033826 NaN
8 1.948430 1.025011 -2.982224
9 0.019698 -0.795876 -0.046431
Upvotes: 0
Reputation: 1
import pandas as pd
import numpy as np
data = {
'A': [-0.166919, -0.297953, -0.120211, np.nan, np.nan, -0.788073, -0.916080, -0.887858, 1.948430, 0.019698],
'B': [0.979728, -0.912674, -0.540679, -2.027325, np.nan, np.nan, -0.612343, 1.033826, 1.025011, -0.795876],
'C': [-0.632955, -1.365463, -0.680481, 1.533582, 0.461821, np.nan, np.nan, np.nan, -2.982224, -0.046431]
}
df = pd.DataFrame(data)
def replace_nan_with_neighbors_avg(df):
# Iterate over columns
for col in df.columns:
# Get indices of NaN values
nan_indices = df[df[col].isna()].index
# Iterate over NaN indices
for i in nan_indices:
# Find previous non-NaN value
prev_val = df[col][:i].dropna().iloc[-1] if not df[col][:i].dropna().empty else np.nan
# Find next non-NaN value
next_val = df[col][i+1:].dropna().iloc[0] if not df[col][i+1:].dropna().empty else np.nan
# Calculate average and replace NaN
if not np.isnan(prev_val) and not np.isnan(next_val):
df.at[i, col] = (prev_val + next_val) / 2
elif not np.isnan(prev_val):
df.at[i, col] = prev_val
elif not np.isnan(next_val):
df.at[i, col] = next_val
return df
df_filled = replace_nan_with_neighbors_avg(df)
print(df_filled)
Upvotes: 0
Reputation: 1
as Ben.T has mentioned above
if you have another group of NaN in the same column
you can consider this lazy solution :)
for column in df:
for ind,row in df[[column]].iterrows():
if ~np.isnan(row[column]):
previous = row[column]
else:
indx = ind + 1
while np.isnan(df.loc[indx,column]):
indx += 1
next = df.loc[indx,column]
previous = df[column][ind] = (previous + next)/2
Upvotes: 0
Reputation: 1
Got a simular Problem. The following code worked for me.
def fill_nan_with_mean_from_prev_and_next(df):
NANrows = pd.isnull(df).any(1).nonzero()[0]
null_df = df.isnull()
for row in NANrows :
for colum in range(0,df.shape[1]):
if(null_df.iloc[row][colum]):
df.iloc[row][colum] = (df.iloc[row-1][colum]+df.iloc[row-1][colum])/2
return df
maybe it is helps someone too.
Upvotes: 0
Reputation: 323226
You can try to understand your logic behind the average that is Geometric progression
s=df.isnull().cumsum()
t1=df[(s==1).shift(-1).fillna(False)].stack().reset_index(level=0,drop=True)
t2=df.lookup(s.idxmax()+1,s.idxmax().index)
df.fillna(t1/(2**s)+t2*(1-0.5**s)*2/2)
Out[212]:
A B C
0 -0.166919 0.979728 -0.632955
1 -0.297953 -0.912674 -1.365463
2 -0.120211 -0.540679 -0.680481
3 -0.454142 -2.027325 1.533582
4 -0.621107 -1.319834 0.461821
5 -0.788073 -0.966089 -1.260201
6 -0.916080 -0.612343 -2.121213
7 -0.887858 1.033826 -2.551718
8 1.948430 1.025011 -2.982224
9 0.019698 -0.795876 -0.046431
Explanation:
1st NaN x/2+y/2=1st
2nd NaN 1st/2+y/2=2nd
3rd NaN 2nd/2+y/2+3rd
Then x/(2**n)+y(1-(1/2)**n)/(1-1/2), this is the key
Upvotes: 2