Jake Wagner
Jake Wagner

Reputation: 826

fill NAN values with Ratio Proportion

Suppose we have a dataFrame which has two columns, the Boroughs of NYC and the list of incidents transpiring in those boroughs.

df['BOROUGH'].value_counts() 

BROOKLYN          368129
QUEENS            315681
MANHATTAN         278583
BRONX             167083
STATEN ISLAND      50194

518,953 rows have null under BOROUGH.

df.shape

(1698623,2)

How can I allocate the null values as Ratio Proportion of the Borough values?

For example:

df['BOROUGH'].value_counts()/df['BOROUGH'].value_counts().sum()

BROOKLYN         0.312061
QUEENS           0.267601
MANHATTAN        0.236153
BRONX            0.141635
STATEN ISLAND    0.042549

31% of the null (518,953) be BROOKLYN = 160,875

27% of the null (518,953) be QUEENS = 140,117 and so forth.....

After the Ratio Proportion of the null:

df['BOROUGH']. value_counts() - Requested

BROOKLYN          529004
QUEENS            455798
.......

Upvotes: 0

Views: 173

Answers (1)

Quang Hoang
Quang Hoang

Reputation: 150745

You can use np.random.choice:

# where the null values are
is_null = df['BOROUGH'].isna()

# obtain the distribution of non-null values
freq = df['BOROUGH'].value_counts(normalize=True)

# random sampling with corresponding frequencies
to_replace = np.random.choice(freq.index, p=freq, size=is_null.sum())

df.loc[is_null, 'BOROUGH'] = to_replace

Upvotes: 2

Related Questions