Steven Nguyen
Steven Nguyen

Reputation: 1

Filling missing data for two columns in pandas dataframe using non missing data

I have a pandas dataframe with 3 columns.

data = data[['id','foo','bar']]

for about 1% of the dataset both foo and bar are missing, but not id. I'm looking to impute with random pairs of non-null foo and bar. Assume id is never null and either foo and bar are both null or both non-null.

Upvotes: 0

Views: 999

Answers (3)

Michele87
Michele87

Reputation: 31

Are you looking to do something like this?

import pandas as pd
import numpy as np
index = range(10)
df = pd.DataFrame(np.random.randn(10,2), index=index, columns=['foo','bar'])
df['foo'].iloc[0:4] = np.nan

invalid = df['foo'].isnull()
nInvalid = df[invalid].shape[0]
valids = df['foo'][-invalid]
nValid = valids.shape[0]
randomInst = np.random.randint(0,nValid,nInvalid)
df['foo'].loc[invalid] = valids.iloc[randomInst].as_matrix()

Edit to apply to bar as well:

df['bar'].loc[invalid] = df['bar'][-invalid].iloc[randomInst].as_matrix()

Upvotes: 0

user707650
user707650

Reputation:

Assuming that when the 'foo' value is missing, the 'bar' value is also missing (as per your question), and that the column types are floating point:

mask = df['foo'].isnull()
df.loc[mask,['foo', 'bar']] = np.random.random((np.sum(mask), 2))


If you want to use valid values from the actual dataframe itself (as they better represent the value range of your data), you could use the following instead:

df.loc[mask,['foo', 'bar']] = df[['foo', 'bar']][~mask].sample(np.sum(mask)).values

(possible with replace=True as argument to the sample method; ditto for np.random.random, of course.)

Upvotes: 0

王晓晨
王晓晨

Reputation: 336

Can this help you?

 import pandas as pd
 data = pd.DataFrame(data)
 invalid_data = data[(data['foo'].isnull()) & (data['bar'].isnull())]

Upvotes: 0

Related Questions