Reputation: 95
I have a dataframe with several currently-empty columns. I want a fraction of these filled with data drawn from a normal distribution, while all the rest are left blank. So, for example, if 60% of the elements should be blank, then 60% would be, while the other 40% would be filled. I already have the normal distribution, via numpy, but I'm trying to figure out how to choose random rows to fill. Currently, the only way I can think of involves FOR loops, and I would rather avoid that.
Does anyone have any ideas for how I could fill empty elements of a dataframe at random? I have a bit of the code below, for the random numbers.
data.loc[data['ColumnA'] == 'B', 'ColumnC'] = np.random.normal(1000, 500, rowsB).astype('int64')
Upvotes: 1
Views: 1321
Reputation: 3009
piRSquared's advice is good. We are left guessing what to solve. Having just looked through some of the latest unanswered pandas questions there are worse.
import pandas as pd
import numpy as np
#some redundancy here as i make an empty dataframe -pretending i start like you with a Dataframe.
df = pd.DataFrame(index = range(11),columns=list('abcdefg'))
num_cells = np.product(df.shape)
# make a 2-dim array with number from 1 to number cells.
arr =np.arange(1,num_cells+1)
#inplace shuffle - this is the key randomization operation
np.random.shuffle(arr)
arr = arr.reshape(df.shape)
#place the shuffled values, normalized to the number of cells, into my dateframe.
df = pd.DataFrame(index = df.index,columns = df.columns,data=arr/np.float(num_cells))
#use applymap to set keep 40% of cells as ones, the other 60% as nan.
df = df.applymap(lambda x: 1 if x > 0.6 else np.nan)
# now sample a full set from normal distribution
# but when multiplying the nans will cause the sampled value to nullify, whilst the multiply by 1 will retain the sample value.
df * np.random.normal(1000,500,df.shape)
Thus you are left with a random 40% of the cells containing a draw from your normal distribution.
If your dataframe was large you could assume the stability of the uniform rand() function. Here i didn't do that and rather determined explicitly how many cells are above and below the threshold.
Upvotes: 1