Reputation: 325
I have a pandas data frame where there are a several missing values. I noticed that the non missing values are close to each other. Thus, I would like to impute the missing values by randomly choosing the non missing values.
For instance:
import pandas as pd
import random
import numpy as np
foo = pd.DataFrame({'A': [2, 3, np.nan, 5, np.nan], 'B':[np.nan, 4, 2, np.nan, 5]})
foo
A B
0 2 NaN
1 3 4
2 NaN 2
3 5 NaN
4 NaN 5
I would like for instance foo['A'][2]=2
and foo['A'][5]=3
The shape of my pandas DataFrame is (6940,154).
I try this
foo['A'] = foo['A'].fillna(random.choice(foo['A'].values.tolist()))
But it not working. Could you help me achieve that? Best regards.
Upvotes: 13
Views: 20694
Reputation: 1
import random
import numpy as np
df["column"] = df["column"].apply(
lambda x: random.choice(df["column"].dropna().unique()) if pd.isna(x) else x)
Upvotes: 0
Reputation: 1
Replacing NaN with a random number from the range:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': [1, 2, np.nan, 4, np.nan, 6, 7, np.nan, 9]})
min_value = 0
max_value = 10
df['A'] = df['A'].apply(lambda x: np.random.randint(min_value, max_value) if pd.isnull(x) else x)
print(df)
Upvotes: 0
Reputation: 552
Not the most concise, but probably the most performant way to go:
nans = df[col].isna()
non_nans = df.loc[df[col].notna(), col]
samples = np.random.choice(non_nans, size=nans.sum())
df.loc[nans, col] = samples
Upvotes: 0
Reputation: 2144
for me only this worked, all the examples above failed. Some filled same number, some didn't fill nothing.
def fill_sample(df, col):
tmp = df[df[col].notna()[col].sample(len(df[df[col].isna()])).values
k = 0
for i,row in df[df[col].isna()].iterrows():
df.at[i, col] = tmp[k]
k+=1
return df
Upvotes: 0
Reputation: 325
What I ended up doing and that worked was:
foo = foo.apply(lambda x: x.fillna(random.choice(x.dropna())), axis=1)
Upvotes: 0
Reputation: 61
I did this for filling NaN values with a random non-NaN value:
import random
df['column'].fillna(random.choice(df['column'][df['column'].notna()]), inplace=True)
Upvotes: 6
Reputation: 856
You can use pandas.fillna
method and the random.choice
method to fill the missing values with a random selection of a particular column.
import random
import numpy as np
df["column"].fillna(lambda x: random.choice(df[df[column] != np.nan]["column"]), inplace =True)
Where column is the column you want to fill with non nan
values randomly.
Upvotes: 11
Reputation: 2328
Here is another Pandas DataFrame approach
import numpy as np
def fill_with_random(df2, column):
'''Fill `df2`'s column with name `column` with random data based on non-NaN data from `column`'''
df = df2.copy()
df[column] = df[column].apply(lambda x: np.random.choice(df[column].dropna().values) if np.isnan(x) else x)
return df
Upvotes: 1
Reputation: 1577
This works well for me on Pandas DataFrame
def randomiseMissingData(df2):
"randomise missing data for DataFrame (within a column)"
df = df2.copy()
for col in df.columns:
data = df[col]
mask = data.isnull()
samples = random.choices( data[~mask].values , k = mask.sum() )
data[mask] = samples
return df
Upvotes: 7
Reputation: 6376
This is another approach to this question after making improvement on the first answer and according to how to check if an numpy int is nand found here in numpy documentation
foo['A'].apply(lambda x: np.random.choice([x for x in range(min(foo['A']),max(foo['A'])]) if (np.isnan(x)) else x)
Upvotes: 3