Lomnewton
Lomnewton

Reputation: 45

Creating missing dummy indicator variable for a list of variable in a dataframe with specific values in python (pandas)

I have a large dataset in pandas. For brevity, let's say I have the following

df = pd.DataFrame({'col1': [101,101,101,201,201,201,np.nan],
                  'col2':[123,123,124,np.nan,321,321,456],
                 'col3':['a',0.7,0.6,1.01,2,1,2],
                 'col4':['w',0.2,'b',0.7,'z',2,3],
                 'col5':[21,'z',0.3,2.3,0.8,'z',1.001],
                 'col6':[11.3,202.0,0.2,0.3,41.0,47,2],
                 'col7':['A','B','C','D','E','F','G']})

Initial Data

Now I want to create categorical variables with the suffix _missing such that for any column in the dataset that contains missing nan a new column (variable) should be created that has values 1 for 'nan' values and 0 otherwise. For example, for col1 and col2, their corresponding variables will be col1_missing and col2_missing.

Then for columns like col3 that have alphabets in a column that is supposed to be numeric, I will like similar result as described above, but with the levels of categories increasing with the number of different alphabets. For example the new column corresponding to col4 will be col4_missing and will contain 0 for non-alphabets, 1 for b, 2 for w and 3 for z. So the resulting frame should look as below:

Resulting dataframe

Is there any python function or package to do this? As a newbie, I am honestly overwhelmed with this and I would be grateful for any help on this.

Upvotes: 1

Views: 630

Answers (1)

mozway
mozway

Reputation: 260530

You can map the values from a dictionary:

def flag(s):
    flags = {'b': 1, 'w': 2, 'z': 3}
    return s.fillna('b').map(lambda x: flags.get(x, 0))

out = (pd
 .concat([df, df.apply(flag).add_suffix('_missing')], axis=1)
 .sort_index(axis=1)
 )

Output:

    col1  col1_missing   col2  col2_missing  col3  col3_missing col4  col4_missing   col5  col5_missing   col6  col6_missing col7  col7_missing
0  101.0             0  123.0             0     a             0    w             2     21             0   11.3             0    A             0
1  101.0             0  123.0             0   0.7             0  0.2             0      z             3  202.0             0    B             0
2  101.0             0  124.0             0   0.6             0    b             1    0.3             0    0.2             0    C             0
3  201.0             0    NaN             1  1.01             0  0.7             0    2.3             0    0.3             0    D             0
4  201.0             0  321.0             0     2             0    z             3    0.8             0   41.0             0    E             0
5  201.0             0  321.0             0     1             0    2             0      z             3   47.0             0    F             0
6    NaN             1  456.0             0     2             0    3             0  1.001             0    2.0             0    G             0

only columns with at least one non-zero

def flag(s):
    flags = {'b': 1, 'w': 2, 'z': 3}
    return s.fillna('b').map(lambda x: flags.get(x, 0))

# flag values 
df2 = df.apply(flag).add_suffix('_missing')

# keep only columns with at least one flag
df2 = df2.loc[:, df2.ne(0).any()]

out = (pd
 .concat([df, df2], axis=1)
 .sort_index(axis=1)
 )

Upvotes: 1

Related Questions