Reputation: 45
I have a large dataset in pandas. For brevity, let's say I have the following
df = pd.DataFrame({'col1': [101,101,101,201,201,201,np.nan],
'col2':[123,123,124,np.nan,321,321,456],
'col3':['a',0.7,0.6,1.01,2,1,2],
'col4':['w',0.2,'b',0.7,'z',2,3],
'col5':[21,'z',0.3,2.3,0.8,'z',1.001],
'col6':[11.3,202.0,0.2,0.3,41.0,47,2],
'col7':['A','B','C','D','E','F','G']})
Now I want to create categorical variables with the suffix _missing
such that for any column in the dataset that contains missing nan
a new column (variable) should be created that has values 1
for 'nan' values and 0
otherwise. For example, for col1
and col2
, their corresponding variables will be col1_missing
and col2_missing
.
Then for columns like col3
that have alphabets in a column that is supposed to be numeric, I will like similar result as described above, but with the levels of categories increasing with the number of different alphabets. For example the new column corresponding to col4
will be col4_missing
and will contain 0 for non-alphabets, 1
for b
, 2
for w
and 3
for z
. So the resulting frame should look as below:
Is there any python function or package to do this? As a newbie, I am honestly overwhelmed with this and I would be grateful for any help on this.
Upvotes: 1
Views: 630
Reputation: 260530
You can map
the values from a dictionary:
def flag(s):
flags = {'b': 1, 'w': 2, 'z': 3}
return s.fillna('b').map(lambda x: flags.get(x, 0))
out = (pd
.concat([df, df.apply(flag).add_suffix('_missing')], axis=1)
.sort_index(axis=1)
)
Output:
col1 col1_missing col2 col2_missing col3 col3_missing col4 col4_missing col5 col5_missing col6 col6_missing col7 col7_missing
0 101.0 0 123.0 0 a 0 w 2 21 0 11.3 0 A 0
1 101.0 0 123.0 0 0.7 0 0.2 0 z 3 202.0 0 B 0
2 101.0 0 124.0 0 0.6 0 b 1 0.3 0 0.2 0 C 0
3 201.0 0 NaN 1 1.01 0 0.7 0 2.3 0 0.3 0 D 0
4 201.0 0 321.0 0 2 0 z 3 0.8 0 41.0 0 E 0
5 201.0 0 321.0 0 1 0 2 0 z 3 47.0 0 F 0
6 NaN 1 456.0 0 2 0 3 0 1.001 0 2.0 0 G 0
def flag(s):
flags = {'b': 1, 'w': 2, 'z': 3}
return s.fillna('b').map(lambda x: flags.get(x, 0))
# flag values
df2 = df.apply(flag).add_suffix('_missing')
# keep only columns with at least one flag
df2 = df2.loc[:, df2.ne(0).any()]
out = (pd
.concat([df, df2], axis=1)
.sort_index(axis=1)
)
Upvotes: 1