Reputation: 3253
I am trying to fill in NaN/null values for every column and every cell within a column in dataframe by random sampling from that column (e.g sample non-NaN value). I am doing right now the following
for col in df:
count = 0
while True:
sample = df[col].sample(n=1)
count += 1
if pd.notna(sample.item()):
df[col].replace(sample, np.nan, inplace=True)
break
if count >= 100:
break
Which is incorrect because:
it has this hack to try to sample 100 times with a hope that you finally find non-NaN within 100 tries.
It would fill cells with the sample sample, while I would like to sample randomly a value for every cell separately e.g not to have any skew
Well it does not work for some reason in any case, resulting df has NaNs as before.
Note: dataframe contains both numbers and strings
Upvotes: 2
Views: 270
Reputation: 879411
You could use np.random.choice
to generate a sample from a population of values:
sample = np.random.choice(pop, size=len(df)-len(pop), replace=True)
For example,
import numpy as np
import pandas as pd
arr = np.random.randint(10, size=(10,3)).astype(float)
mask = np.random.randint(2, size=arr.shape, dtype=bool)
arr[mask] = np.nan
df = pd.DataFrame(arr)
print(df)
# 0 1 2
# 0 8.0 NaN 0.0
# 1 1.0 3.0 2.0
# 2 NaN NaN NaN
# 3 6.0 NaN 7.0
# 4 NaN 8.0 5.0
# 5 1.0 4.0 6.0
# 6 NaN NaN NaN
# 7 NaN NaN NaN
# 8 8.0 NaN NaN
# 9 5.0 NaN 2.0
for col in df:
mask = pd.isnull(df[col])
pop = df[col].dropna()
if len(pop):
sample = np.random.choice(pop, size=len(df)-len(pop), replace=True)
df.loc[mask, col] = sample
print(df)
yields a result such as
0 1 2
0 8.0 4.0 0.0
1 1.0 3.0 2.0
2 1.0 8.0 2.0
3 6.0 3.0 7.0
4 8.0 8.0 5.0
5 1.0 4.0 6.0
6 1.0 8.0 2.0
7 8.0 4.0 6.0
8 8.0 4.0 7.0
9 5.0 3.0 2.0
df[col]
returns a Series. Modifying this Series is not guaranteed to
modify df
itself. Thus
df[col].replace(sample, np.nan, inplace=True)
modifies the Series returned by df[col]
but fails to modify df
.
Generally, to ensure that you modify a DataFrame, use df.loc[...] = ...
or df.iloc[...] = ...
or generate a new DataFrame and reassign it to df
(e.g. df = new_df
), or generate a new column of values and reassign it to a column (e.g. df[col] = values
).
Upvotes: 4