Math
Math

Reputation: 251

Explode multiple columns

My dataset has multiple columns which include lists.

   A             B                First_val   Second_value  Colour
0  aaa  ['hello', 'world']   ['bla','bla']   ['blue']       Red
1  sad  ['aaa']              ['cat']         ['hi']         Green
2  vas  ['min','max']        ['avg']         []             Yellow
...

The colour should be assigned only based on A value: this means that, if I have multiple rows because of exploding lists, I should have red all the times, then green, then yellow for all the rows having in A aaa, sad and vas respectively.

I have tried with

(df.set_index('A')
              .apply(lambda x: x.apply(pd.Series).stack())
              .reset_index()
              .drop('level_1', 1))

but the value of Colour is not assigned per all A's values, but only for the first one (the others are NaN). Example (wrong):

   A             B                First_val   Second_value  Colour
0  aaa  hello                     bla          blue       Red
1  aaa  world                     bla          nan        nan
...

Expected output:

   A             B                First_val   Second_value  Colour
0  aaa  hello                     bla          blue       Red
1  aaa  world                     bla          nan        Red
...

How can I get it?

Upvotes: 1

Views: 288

Answers (1)

jezrael
jezrael

Reputation: 862511

I think you need add not lists column Colour to MultiIndex so not processing by converting lists to Series in DataFrame.set_index, also added DataFrame.reindex for original order of columns:

df = (df.set_index(['A','Colour'])
        .apply(lambda x: x.apply(pd.Series).stack())
        .reset_index()
        .reindex(df.columns, axis=1)
              )
print (df)
     A      B First_val Second_value  Colour
0  aaa  hello       bla         blue     Red
1  aaa  world       bla          NaN     Red
2  sad    aaa       cat           hi   Green
3  vas    min       avg          NaN  Yellow
4  vas    max       NaN          NaN  Yellow

Another I hope faster alternative with convert values of lists to DataFrames:

df = (df.set_index(['A','Colour'])
        .apply(lambda x: pd.DataFrame(x.tolist(), index=x.index).stack())
        .reset_index()
        .reindex(df.columns, axis=1)
              )
print (df)
     A      B First_val Second_value  Colour
0  aaa  hello       bla         blue     Red
1  aaa  world       bla          NaN     Red
2  sad    aaa       cat           hi   Green
3  vas    min       avg          NaN  Yellow
4  vas    max       NaN          NaN  Yellow

Upvotes: 3

Related Questions