Fill NaN values in a column within a specific range of values

Question

I am wanting to do the following:

Fill NaN values in a single column using values within a specific range.

The range I am wanting to use is the mean of the non-Nan values in the column +/- 1 one standard deviation of the computed mean.

NOTE If possible, I would like to be able to use multiples of the std dev by simply multiplying it by a constant.

I thought I had it (see full code below) but the output from print(df['C'].describe()) shows that I am generating values well outside my desired range. In fact, I am generating numbers outside the original min and max of the column, which is definitely not what I want.

import pandas as pd
import numpy as np
import sys

print('Python: {}'.format(sys.version))
print('NumPy: {}'.format(np.__version__))
print('Pandas: {}'.format(pd.__version__))
print('\033[1;31m' + '--------------' + '\033[0m')  # Bold red

display_settings = {
    'max_columns': 15,
    'max_colwidth': 60,
    'expand_frame_repr': False,  # Wrap to multiple pages
    'max_rows': 50,
    'precision': 6,
    'show_dimensions': False
}
# pd.options.display.float_format = '{:,.2f}'.format

for op, value in display_settings.items():
    pd.set_option("display.{}".format(op), value)

df = pd.DataFrame(np.random.randint(0, 1000, size=(200, 10)), columns=list('ABCDEFGHIJ'))
# df = pd.DataFrame(np.random.randint(0, 100, size=(20, 4)), columns=list(['AA','BB','C2','D2']))
print(df, '
')

# https://stackoverflow.com/questions/55149738/pandas-replace-values-with-nan-at-random
df['C'] = df['C'].sample(frac=0.65)  # The percentage of non-NaN values.
df['H'] = df['H'].sample(frac=0.75)  # The percentage of non-NaN values.
print(df, '
')
print(df.isnull().sum(), '
')
print(df['C'].describe(), '
')

def fillNaN_with_unifrand(col):
    a = col.values
    m = np.isnan(a)  # mask of NaNs
    mu, sigma = col.mean(), col.std()
    a[m] = np.random.normal(mu, sigma, size=m.sum())
    return col


# https://stackoverflow.com/questions/46543060/how-to-replace-every-nan-in-a-column-with-different-random-values-using-pandas?rq=1
fillNaN_with_unifrand(df['C'])
pd.options.display.float_format = '{:.0f}'.format
print(df, '
')
print(df.isnull().sum(), '
')
print(df['C'].describe())

Output of print(df['C'].describe()):

Starting:

count    130.000000
mean     462.446154
std      290.760432
min        7.000000
25%      187.500000
50%      433.000000
75%      671.250000
max      992.000000
Name: C, dtype: float64

Ending:

count    200
mean     517
std      298
min     -187
25%      281
50%      544
75%      763
max     1218
Name: C, dtype: float64

Note the min and max. All of my fill values (in this instance) should have been 462 +/- 290.

offeltoffel · Accepted Answer

Well, this is not how statistics work. A Gaussian Normal Distribution has a mean and a std but values can be drawn far away from mean +- std, they are just less likeley. As per definition of a normal distribution, 68 % of all values are within +- 1*std, 95 % are within +-2*std and so on. The question is: What do you want to do with outliers? Set them to mean +- std or draw again?

Case 1: Set outliers to min/max

This is usually unwanted, as this changes your distribution and puts more weight on the lower and upper boundary.

from matplotlib import pyplot as plt

mu = 100
sigma = 7
a = np.random.normal(mu, sigma, size=2000) # I used a size of 2000 as an example
a[a<(mu-sigma)] = mu-sigma
a[a>(mu+sigma)] = mu+sigma
plt.hist(a, bins=12, edgecolor='black')
plt.show()

Case 2: Truncated Normal Distribution

What you usually want is the Truncated Normal Distribution. It creates a distribution with an upper and a lower boundary. You find this function at the scipy.stats module. It works a bit different though: you first create the distribution by normalizing the lower and upper clip and then you create a numer of random variates rvs from it like this:

from matplotlib import pyplot as plt
import scipy.stats as stats

mu = 100
sigma = 7
lower_clip = mu-sigma
upper_clip = mu+sigma
a = stats.truncnorm((lower_clip - mu) / sigma, (upper_clip - mu) / sigma, loc=mu, scale=sigma)
plt.hist(a.rvs(2000), bins=12, edgecolor='black')
plt.show()

The constant of multiples of sigma is easily implemented. You can just change your lower and upper clip like

lower_clip = mu-x*sigma

with x being your constant.

Fill NaN values in a column within a specific range of values

Answers (1)

Case 1: Set outliers to min/max

Case 2: Truncated Normal Distribution

Related Questions