Reputation: 251
My dataset has multiple columns which include lists.
A B First_val Second_value Colour
0 aaa ['hello', 'world'] ['bla','bla'] ['blue'] Red
1 sad ['aaa'] ['cat'] ['hi'] Green
2 vas ['min','max'] ['avg'] [] Yellow
...
The colour should be assigned only based on A value: this means that, if I have multiple rows because of exploding lists, I should have red all the times, then green, then yellow for all the rows having in A aaa, sad and vas respectively.
I have tried with
(df.set_index('A')
.apply(lambda x: x.apply(pd.Series).stack())
.reset_index()
.drop('level_1', 1))
but the value of Colour is not assigned per all A's values, but only for the first one (the others are NaN). Example (wrong):
A B First_val Second_value Colour
0 aaa hello bla blue Red
1 aaa world bla nan nan
...
Expected output:
A B First_val Second_value Colour
0 aaa hello bla blue Red
1 aaa world bla nan Red
...
How can I get it?
Upvotes: 1
Views: 288
Reputation: 862511
I think you need add not lists column Colour
to MultiIndex
so not processing by converting lists to Series in DataFrame.set_index
, also added DataFrame.reindex
for original order of columns:
df = (df.set_index(['A','Colour'])
.apply(lambda x: x.apply(pd.Series).stack())
.reset_index()
.reindex(df.columns, axis=1)
)
print (df)
A B First_val Second_value Colour
0 aaa hello bla blue Red
1 aaa world bla NaN Red
2 sad aaa cat hi Green
3 vas min avg NaN Yellow
4 vas max NaN NaN Yellow
Another I hope faster alternative with convert values of lists to DataFrame
s:
df = (df.set_index(['A','Colour'])
.apply(lambda x: pd.DataFrame(x.tolist(), index=x.index).stack())
.reset_index()
.reindex(df.columns, axis=1)
)
print (df)
A B First_val Second_value Colour
0 aaa hello bla blue Red
1 aaa world bla NaN Red
2 sad aaa cat hi Green
3 vas min avg NaN Yellow
4 vas max NaN NaN Yellow
Upvotes: 3