Reputation: 2458
Say this is my DataFrame:
+-------+-----+----------------------+--------------------------+
| Month | Day | Jane | Bob |
+=======+=====+======================+==========================+
| Mar | 10 | ['Orange', 'Apples'] | ['Banana', 'Apples'] |
+-------+-----+----------------------+--------------------------+
| Mar | 11 | ['Pear', 'Grape'] | ['Strawberry', 'Cherry'] |
+-------+-----+----------------------+--------------------------+
Ultimately, I want to get to this (watch for "Apple" which should appear only once in the output):
+--------+-----+------------+------+-----+
| Month | Day | Fruit | Jane | Bob |
+========+=====+============+======+=====+
| Mar | 10 | Orange | 1 | 0 |
+--------+-----+------------+------+-----+
| Mar | 10 | Apple | 1 | 1 |
+--------+-----+------------+------+-----+
| Mar | 10 | Banana | 0 | 1 |
+--------+-----+------------+------+-----+
| Mar | 11 | Pear | 1 | 0 |
+--------+-----+------------+------+-----+
| Mar | 11 | Grape | 1 | 0 |
+--------+-----+------------+------+-----+
| Mar | 11 | Strawberry | 0 | 1 |
+--------+-----+------------+------+-----+
| Mar | 11 | Cherry | 0 | 1 |
+--------+-----+------------+------+-----+
So kind of trying to do a get_dummies
that expands into rows. Not quite sure how to tackle this. I tried starting with explode
for each column, but in my real-life case with many more columns and values, this causes the number of rows to blow out as well, with each "Fruit" being repeated for each person (unlike the above case with "Apple").
Any suggestion?
Upvotes: 1
Views: 491
Reputation: 71689
We can stack
the dataframe to reshape then explode
the reshaped frame followed by groupby
on Month
, Day
and Fruit
then agg level_2
with join
and encode using get_dummies
c = ['Month', 'Day']
out = df.set_index(c).stack().explode().reset_index(name='Fruit')
out.groupby([*c, 'Fruit'], sort=False)['level_2'].agg(','.join).str.get_dummies(sep=',')
Bob Jane
Month Day Fruit
Mar 10 Orange 0 1
Apples 1 1
Banana 1 0
11 Pear 0 1
Grape 0 1
Strawberry 1 0
Cherry 1 0
Upvotes: 2
Reputation: 195438
df["Fruit"] = df.loc[:, "Jane":].apply(
lambda x: set(i for l in x for i in l), axis=1
)
df = df.explode("Fruit")
for c in df.loc[:, "Jane":"Bob"]:
df[c] = df.apply(lambda x: int(x["Fruit"] in x[c]), axis=1)
print(df)
Prints:
Month Day Jane Bob Fruit
0 Mar 10 1 0 Orange
0 Mar 10 1 1 Apples
0 Mar 10 0 1 Banana
1 Mar 11 0 1 Cherry
1 Mar 11 0 1 Strawberry
1 Mar 11 1 0 Pear
1 Mar 11 1 0 Grape
Upvotes: 0