Reputation: 2605
I have a data frame (data_train) with NaN values, A sample is given below:
republican n y
republican n NaN
democrat NaN n
democrat n y
I want to replace all the NaN with some random values like .
republican n y
republican n rnd2
democrat rnd1 n
democrat n y
How do I do it.
I tried the following, but had no luck:
df_rand = pd.DataFrame(np.random.randn(data_train.shape[0],data_train.shape[1]))
data_train[pd.isnull(data_train)] = dfrand[pd.isnull(data_train)]
when I do the above with a dataframe with random numerical data the above script works fine.
Upvotes: 13
Views: 27256
Reputation: 1
Replacing NaNs according to discrete column distribution
import pandas as pd
import numpy as np
def discrete_column_resampling(df, column_names):
for column in column_names:
value_counts = df[column].value_counts()
counts = np.array(value_counts.values.tolist())
probabilities = counts / np.sum(counts)
values = value_counts.index.tolist()
df[column] = df[column].apply(lambda l: l if not pd.isna(l) else \
np.random.choice(values, p=probabilities))
Upvotes: 0
Reputation: 43
If by random, you actually mean / need unique values, then this fast solution works with all kinds of further, fast modifications possible:
mask = df[col].isnull()
df[col][mask] = df[col][mask].index#.astype(str).str.etc...
Upvotes: 1
Reputation: 6367
Try my code. I combined the prior answers to the working example:
M = len(data_train.index)
N = len(data_train.columns)
df_rand = pd.DataFrame(np.random.randn(M,N), columns=data_train.columns, index=data_train.index)
data_train[pd.isnull(data_train)] = df_rand[pd.isnull(data_train)]
It is faster than the use of apply_map
.
Upvotes: 0
Reputation: 55
You can randomly fill values by using #tilde operator
df['column'].dropna()
df["column"].fillna(np.random.choice(df['column'][~df['column'].isna()]),inplace = True)
Upvotes: 1
Reputation: 13
Using fillna() inside loop and setting 'limit' attribute as 1 can help in replacing nan with different random values.
import random
while(Series.isnull().sum()!=0):
Series.fillna(random.uniform(0,100),inplace=True,limit=1)
Upvotes: 0
Reputation: 1222
If you want to replace all NaNs from the DF with random values from a list, you can do something like this:
import numpy as np
df.applymap(lambda l: l if not np.isnan(l) else np.random.choice([1, 3]))
Upvotes: 7
Reputation: 131
If you want to replace NaN in your column with hot deck technique, I can propose way like this :
def hot_deck(dataframe) :
dataframe = dataframe.fillna(0)
for col in dataframe.columns :
assert (dataframe[col].dtype == np.float64) | (dataframe[col].dtype == np.int64)
liste_sample = dataframe[dataframe[col] != 0][col].unique()
dataframe[col] = dataframe.apply(lambda row : random.choice(liste_sample) if row[col] == 0 else row[col],axis=1)
return dataframe
After if you prefer just replace NaN with a new random value for each iteration you can do a thing like that. You've just to determine the max value of your random choices.
def hot_deck(dataframe,max_value) :
dataframe = dataframe.fillna(0)
for col in dataframe.columns :
assert (dataframe[col].dtype == np.float64) | (dataframe[col].dtype == np.int64)
liste_sample = random.sample(range(max_value),dataframe.isnull().sum())
dataframe[col] = dataframe.apply(lambda row : random.choice(liste_sample) if row[col] == 0 else row[col],axis=1)
return dataframe
Upvotes: 2
Reputation: 5879
You can use the pandas update command, this way:
1) Generate a random DataFrame with the same columns and index as the original one:
import numpy as np; import pandas as pd
M = len(df.index)
N = len(df.columns)
ran = pd.DataFrame(np.random.randn(M,N), columns=df.columns, index=df.index)
2) Then use update
, so that the NaN values in df
will be replaced by the generated random values
df.update(ran)
In the above example I used values from a standard normal, but you can also use values randomly picked from the original DataFrame:
import numpy as np; import pandas as pd
M = len(df.index)
N = len(df.columns)
val = np.ravel(df.values)
val = val[~np.isnan(val)]
val = np.random.choice(val, size=(M,N))
ran = pd.DataFrame(val, columns=df.columns, index=df.index)
df.update(ran)
Upvotes: 6
Reputation: 16134
Well, if you use fillna
to fill the NaN
, a random generator works only once and will fill all N/As with the same number.
So, make sure that a random number is generated and used each time. For a dataframe like this :
Date A B
0 2015-01-01 NaN NaN
1 2015-01-02 NaN NaN
2 2015-01-03 NaN NaN
3 2015-01-04 NaN NaN
4 2015-01-05 NaN NaN
5 2015-01-06 NaN NaN
6 2015-01-07 NaN NaN
7 2015-01-08 NaN NaN
8 2015-01-09 NaN NaN
9 2015-01-10 NaN NaN
10 2015-01-11 NaN NaN
11 2015-01-12 NaN NaN
12 2015-01-13 NaN NaN
13 2015-01-14 NaN NaN
14 2015-01-15 NaN NaN
15 2015-01-16 NaN NaN
I used the following code to fill up the NaNs
in column A:
import random
x['A'] = x['A'].apply(lambda v: random.random() * 1000)
Which will give us something like:
Date A B
0 2015-01-01 96.538211 NaN
1 2015-01-02 404.683392 NaN
2 2015-01-03 849.614253 NaN
3 2015-01-04 590.030660 NaN
4 2015-01-05 203.167519 NaN
5 2015-01-06 980.508258 NaN
6 2015-01-07 221.088002 NaN
7 2015-01-08 285.013762 NaN
Upvotes: 6
Reputation: 10398
Just use fillna
this way
import random
data_train.fillna(random.random())
Upvotes: -1