Reputation: 3135
I have a very simple dataframe in Pandas,
testdf = [{'name' : 'id1', 'W': np.NaN, 'L': 0, 'D':0},
{'name' : 'id2', 'W': 0, 'L': np.NaN, 'D':0},
{'name' : 'id3', 'W': np.NaN, 'L': 10, 'D':0},
{'name' : 'id4', 'W': 75, 'L': 20, 'D':0}
]
testdf = pd.DataFrame(testdf)
testdf = testdf[['name', 'W', 'L', 'D']]
which looks like this:
| name | W | L | D |
|------|-----|-----|---|
| id1 | NaN | 0 | 0 |
| id2 | 0 | NaN | 0 |
| id3 | NaN | 10 | 0 |
| id4 | 75 | 20 | 0 |
My goal is simple:
1) I want to impute all the missing values by simply replacing them with a 0.
2) Next I want to create indicator columns with a 0 or 1 to indicate that the new value (the 0) is indeed created by the imputation process.
It's probably easier to just show instead of explain with words:
| name | W | W_indicator | L | L_indicator | D | D_indicator |
|------|----|-------------|----|-------------|---|-------------|
| id1 | 0 | 1 | 0 | 0 | 0 | 0 |
| id2 | 0 | 0 | 0 | 1 | 0 | 0 |
| id3 | 0 | 1 | 10 | 0 | 0 | 0 |
| id4 | 75 | 0 | 20 | 0 | 0 | 0 |
My attempts have failed, since I get stuck trying to change all non-NaN values to some placeholder value, then change all NaNs to a 0, then change back the placeholder value to NaN, etc etc. It gets messy so fast. Then I keep getting all kinds of slice warnings. And the masks get all jumbled. I'm sure there's a much more elegant way to do this than my wonky heuristical methods.
Upvotes: 4
Views: 4950
Reputation: 23211
A few years late to the party, but this is how I do it:
transformer = FeatureUnion(
transformer_list=[
('features', SimpleImputer(strategy='mean')),
('indicators', MissingIndicator())])
transformer = transformer.fit(Xnum, df.fraud)
results = transformer.transform(Xnum)
results.shape
Upvotes: 1
Reputation: 863301
You can use isnull
with convert to int
by astype
and add_prefix
for new df
and then concat
with reindex_axis
by cols
created by some solution from this answers:
cols = ['W','L','D']
df = testdf[cols].isnull().astype(int).add_suffix('_indicator')
print (df)
W_indicator L_indicator D_indicator
0 1 0 0
1 0 1 0
2 1 0 0
3 0 0 0
Solution with generator:
def mygen(lst):
for item in lst:
yield item
yield item + '_indicator'
df1 = pd.concat([testdf.fillna(0), df], axis=1) \
.reindex_axis(['name'] + list(mygen(cols)), axis=1)
print (df1)
name W W_indicator L L_indicator D D_indicator
0 id1 0.0 1 0.0 0 0 0
1 id2 0.0 0 0.0 1 0 0
2 id3 0.0 1 10.0 0 0 0
3 id4 75.0 0 20.0 0 0 0
And solution with list comprehenion:
cols = ['name'] + [item for x in cols for item in (x, x + '_indicator')]
df1 = pd.concat([testdf.fillna(0), df], axis=1).reindex_axis(cols, axis=1)
print (df1)
name W W_indicator L L_indicator D D_indicator
0 id1 0.0 1 0.0 0 0 0
1 id2 0.0 0 0.0 1 0 0
2 id3 0.0 1 10.0 0 0 0
3 id4 75.0 0 20.0 0 0 0
Upvotes: 5