Reputation: 75
I am currently have a large dataset with quite a few missing values.
I'm trying to fill in these missing values by creating a random distribution with the data I have and sampling it. Eg create a random distribution then randomly choose a number from 0 to 1 and fill in the missing data with the corresponding value
I've read documentation for scipy and numpy. I think I'm looking for a continuous version of random.choice.
Company | Weight |
---|---|
a | 30 |
a | 45 |
a | 27 |
a | na |
a | 57 |
a | 57 |
a | na |
I'm trying to fill the NA columns by creating a continuous distribution using the data I already have.
I've tried using np.random.choice so far, ie: random.choice(30,45,27,57, [0.2,0.2,0.2,0.4])
However, this only returns back the specific arguements I input, however, I am trying to create a continuous model so that I can return any number between 27 and 57 with probability based on how many times a certain value appears in my previous data.
So in this case, numbers closer to 57 will be more likely to be chosen as it appears more frequently in my previous data.
Upvotes: 2
Views: 979
Reputation: 11161
Kernel density estimation (KDE) is a common method to generate continuous distributions from sample data, but it generally requires tuning some parameters. Other methods include mean/mode imputation (basic) and model-based prediction (more sophisticated).
We fit a kernel density estimator below and then generate random samples from the density with kde.sample
to fill the nan
values below:
import pandas as pd
import numpy as np
from numpy import nan
from sklearn.neighbors import KernelDensity
BANDWIDTH = 1
KERNEL = "gaussian"
data = {'company': {0: 'A', 1: 'A', 2: 'A', 3: 'A', 4: 'A', 5: 'A', 6: 'A'},
'weight': {0: 30.0, 1: 45.0, 2: 27.0, 3: nan, 4: 57.0, 5: 57.0, 6: nan}}
df = pd.DataFrame.from_dict(data)
kde = KernelDensity(kernel=KERNEL, bandwidth=BANDWIDTH).fit(df[["weight"]].dropna().values)
# replace nan with sampled values from kde
n_missing = df.weight.isna().sum()
df.loc[df.weight.isna(), "weight"] = kde.sample(n_missing)
output:
company weight
0 A 30.000000
1 A 45.000000
2 A 27.000000
3 A 56.542771
4 A 57.000000
5 A 57.000000
6 A 38.970918
sample data and density plots:
import plotly.express as px
# histogram
px.histogram(df.weight, nbins=40).show()
# density line plot
x_vals = np.linspace(df.weight.min(), df.weight.max(), 1000)
density = np.exp(kde.score_samples(x_vals.reshape(-1,1)))
px.line(x=x, y=density).show()
Upvotes: 1