ChrisHo1341
ChrisHo1341

Reputation: 75

Create continuous distribution and sample from it

I am currently have a large dataset with quite a few missing values.

I'm trying to fill in these missing values by creating a random distribution with the data I have and sampling it. Eg create a random distribution then randomly choose a number from 0 to 1 and fill in the missing data with the corresponding value

I've read documentation for scipy and numpy. I think I'm looking for a continuous version of random.choice.

Company Weight
a 30
a 45
a 27
a na
a 57
a 57
a na

I'm trying to fill the NA columns by creating a continuous distribution using the data I already have.

I've tried using np.random.choice so far, ie: random.choice(30,45,27,57, [0.2,0.2,0.2,0.4])

However, this only returns back the specific arguements I input, however, I am trying to create a continuous model so that I can return any number between 27 and 57 with probability based on how many times a certain value appears in my previous data.

So in this case, numbers closer to 57 will be more likely to be chosen as it appears more frequently in my previous data.

Upvotes: 2

Views: 979

Answers (1)

anon01
anon01

Reputation: 11161

Kernel density estimation (KDE) is a common method to generate continuous distributions from sample data, but it generally requires tuning some parameters. Other methods include mean/mode imputation (basic) and model-based prediction (more sophisticated).

We fit a kernel density estimator below and then generate random samples from the density with kde.sample to fill the nan values below:

import pandas as pd
import numpy as np
from numpy import nan
from sklearn.neighbors import KernelDensity

BANDWIDTH = 1
KERNEL = "gaussian"

data = {'company': {0: 'A', 1: 'A', 2: 'A', 3: 'A', 4: 'A', 5: 'A', 6: 'A'},
'weight': {0: 30.0, 1: 45.0, 2: 27.0, 3: nan, 4: 57.0, 5: 57.0, 6: nan}}
df = pd.DataFrame.from_dict(data)

kde = KernelDensity(kernel=KERNEL, bandwidth=BANDWIDTH).fit(df[["weight"]].dropna().values)

# replace nan with sampled values from kde    
n_missing = df.weight.isna().sum()
df.loc[df.weight.isna(), "weight"] = kde.sample(n_missing)

output:

  company     weight
0       A  30.000000
1       A  45.000000
2       A  27.000000
3       A  56.542771
4       A  57.000000
5       A  57.000000
6       A  38.970918

sample data and density plots:

import plotly.express as px

# histogram
px.histogram(df.weight, nbins=40).show()

# density line plot
x_vals = np.linspace(df.weight.min(), df.weight.max(), 1000)
density = np.exp(kde.score_samples(x_vals.reshape(-1,1)))
px.line(x=x, y=density).show()

enter image description here

enter image description here

Upvotes: 1

Related Questions