Reputation: 826
Suppose we have a dataFrame
which has two columns, the Boroughs of NYC and the list of incidents transpiring in those boroughs.
df['BOROUGH'].value_counts()
BROOKLYN 368129
QUEENS 315681
MANHATTAN 278583
BRONX 167083
STATEN ISLAND 50194
518,953 rows have null
under BOROUGH
.
df.shape
(1698623,2)
How can I allocate the null values as Ratio Proportion of the Borough values?
For example:
df['BOROUGH'].value_counts()/df['BOROUGH'].value_counts().sum()
BROOKLYN 0.312061
QUEENS 0.267601
MANHATTAN 0.236153
BRONX 0.141635
STATEN ISLAND 0.042549
31% of the null (518,953) be BROOKLYN
= 160,875
27% of the null (518,953) be QUEENS
= 140,117
and so forth.....
After the Ratio Proportion of the null
:
df['BOROUGH']. value_counts() - Requested
BROOKLYN 529004
QUEENS 455798
.......
Upvotes: 0
Views: 173
Reputation: 150745
You can use np.random.choice
:
# where the null values are
is_null = df['BOROUGH'].isna()
# obtain the distribution of non-null values
freq = df['BOROUGH'].value_counts(normalize=True)
# random sampling with corresponding frequencies
to_replace = np.random.choice(freq.index, p=freq, size=is_null.sum())
df.loc[is_null, 'BOROUGH'] = to_replace
Upvotes: 2