Pylander
Pylander

Reputation: 1591

Fill NaN Column with Other Column Values, Duplicates New Row

I have a bit of a perplexing operation to try to accomplish efficiently on a dataset with the following general formal:

id,date,ind_1,ind_2,ind_3,ind_4
1,2014-01-01,ind_1,NaN,NaN,NaN
2,2014-01-02,ind_1,NaN,ind_3,NaN
3,2014-01-03,ind_1,ind_2,ind_3,NaN

I am trying to figure out how I can create a new column "ind_all" that is filled with any non-null "ind" column. That is simple enough. I can use .idxmax(). However, the tricky part is that I can have multiple "ind" per row. This means I need to create a new record when there are duplicates. The above example should end up looking like this in the end:

id,date,ind_1,ind_2,ind_3,ind_4,ind_all
1,2014-01-01,ind_1,NaN,NaN,NaN,ind_1
2,2014-01-02,ind_1,NaN,ind_3,NaN,ind_1
2,2014-01-02,ind_1,NaN,ind_3,NaN,ind_3
3,2014-01-03,ind_1,ind_2,ind_3,NaN,ind_1
3,2014-01-03,ind_1,ind_2,ind_3,NaN,ind_2
3,2014-01-03,ind_1,ind_2,ind_3,NaN,ind_3

Any tips or tricks are most appreciated as always!

Upvotes: 2

Views: 68

Answers (1)

cs95
cs95

Reputation: 402323

There's a merge based solution using melt/stack to build the RHS.

v = (df.drop('date', 1)
       .melt('id')
       .drop('variable', 1)
       .dropna()
       .rename({'value' : 'ind_all'}, axis=1)
)

df.merge(v)

   id        date  ind_1  ind_2  ind_3  ind_4 ind_all
0   1  2014-01-01  ind_1    NaN    NaN    NaN   ind_1
1   2  2014-01-02  ind_1    NaN  ind_3    NaN   ind_1
2   2  2014-01-02  ind_1    NaN  ind_3    NaN   ind_3
3   3  2014-01-03  ind_1  ind_2  ind_3    NaN   ind_1
4   3  2014-01-03  ind_1  ind_2  ind_3    NaN   ind_2
5   3  2014-01-03  ind_1  ind_2  ind_3    NaN   ind_3

Or,

df.merge(df.drop('date', 1)
           .set_index('id')
           .stack()
           .reset_index(1, drop=True)
           .to_frame('ind_all'), 
         left_on='id', 
         right_index=True
)

   id        date  ind_1  ind_2  ind_3  ind_4 ind_all
0   1  2014-01-01  ind_1    NaN    NaN    NaN   ind_1
1   2  2014-01-02  ind_1    NaN  ind_3    NaN   ind_1
1   2  2014-01-02  ind_1    NaN  ind_3    NaN   ind_3
2   3  2014-01-03  ind_1  ind_2  ind_3    NaN   ind_1
2   3  2014-01-03  ind_1  ind_2  ind_3    NaN   ind_2
2   3  2014-01-03  ind_1  ind_2  ind_3    NaN   ind_3

Upvotes: 4

Related Questions