YohanRoth
YohanRoth

Reputation: 3253

For every column and cell in dataframe fill in NaNs/Nulls with random value from that column

I am trying to fill in NaN/null values for every column and every cell within a column in dataframe by random sampling from that column (e.g sample non-NaN value). I am doing right now the following

   for col in df:
        count = 0
        while True:
            sample = df[col].sample(n=1)
            count += 1
            if pd.notna(sample.item()):
                df[col].replace(sample, np.nan, inplace=True)
                break
            if count >= 100:
                break

Which is incorrect because:

  1. it has this hack to try to sample 100 times with a hope that you finally find non-NaN within 100 tries.

  2. It would fill cells with the sample sample, while I would like to sample randomly a value for every cell separately e.g not to have any skew

  3. Well it does not work for some reason in any case, resulting df has NaNs as before.

Note: dataframe contains both numbers and strings

Upvotes: 2

Views: 270

Answers (1)

unutbu
unutbu

Reputation: 879411

You could use np.random.choice to generate a sample from a population of values:

sample = np.random.choice(pop, size=len(df)-len(pop), replace=True)

For example,

import numpy as np
import pandas as pd

arr = np.random.randint(10, size=(10,3)).astype(float)
mask = np.random.randint(2, size=arr.shape, dtype=bool)
arr[mask] = np.nan
df = pd.DataFrame(arr)
print(df)
#      0    1    2
# 0  8.0  NaN  0.0
# 1  1.0  3.0  2.0
# 2  NaN  NaN  NaN
# 3  6.0  NaN  7.0
# 4  NaN  8.0  5.0
# 5  1.0  4.0  6.0
# 6  NaN  NaN  NaN
# 7  NaN  NaN  NaN
# 8  8.0  NaN  NaN
# 9  5.0  NaN  2.0

for col in df:
    mask = pd.isnull(df[col])
    pop = df[col].dropna()
    if len(pop):
        sample = np.random.choice(pop, size=len(df)-len(pop), replace=True)
        df.loc[mask, col] = sample


print(df)

yields a result such as

     0    1    2
0  8.0  4.0  0.0
1  1.0  3.0  2.0
2  1.0  8.0  2.0
3  6.0  3.0  7.0
4  8.0  8.0  5.0
5  1.0  4.0  6.0
6  1.0  8.0  2.0
7  8.0  4.0  6.0
8  8.0  4.0  7.0
9  5.0  3.0  2.0

df[col] returns a Series. Modifying this Series is not guaranteed to modify df itself. Thus

df[col].replace(sample, np.nan, inplace=True)

modifies the Series returned by df[col] but fails to modify df.

Generally, to ensure that you modify a DataFrame, use df.loc[...] = ... or df.iloc[...] = ... or generate a new DataFrame and reassign it to df (e.g. df = new_df), or generate a new column of values and reassign it to a column (e.g. df[col] = values).

Upvotes: 4

Related Questions