Reputation: 129
I have a quite big dataframe containing 3 columns, each of which contains a list. Each list can be arbitrarily long (and is, sometimes, quite long).
a Yes No
[1, 2, 3] ["a", "b"] ["A", "B", "C"]
[7, 11, 6] ["a", "d", "f"] ["C", "H", "L", "Z"]
I want to create a new dataframe which looks like this:
C1 C2 value
1 a 1
1 b 1
2 a 1
2 b 1
...
3 b 1
1 A 0
1 B 0
1 C 0
...
3 C 0
...
7 a 1
7 d 1
...
6 Z 0
The order of the rows does not matter. I'm looking for an efficient way of doing that.
What I'm doing at the moment is the following (I might be complicating things a bit):
new_df = pd.DataFrame(columns=["C1", "C2", "value"])
def compute_pairs(df1, new_df):
perm = list(itertools.product(df1.a, df1.Yes))
for p in perm:
new_df.loc[len(new_df)] = list(p) + [1]
perm = list(itertools.product(df1.a, df1.No))
for p in perm:
df.loc[len(new_df)] = list(p) + [0]
df1.apply(lambda x: compute_pairs(x, new_df), axis=1)
However, the for look is quite slow. I tried to use map but I failed.
Any improvements?
Upvotes: 0
Views: 41
Reputation: 150825
Let's try melt
with explode
:
(df.melt('a',var_name='value',value_name='C')
.explode('C').explode('a')
)
Output:
a value C
0 1 Yes a
0 2 Yes a
0 3 Yes a
0 1 Yes b
0 2 Yes b
0 3 Yes b
1 7 Yes a
1 11 Yes a
1 6 Yes a
1 7 Yes d
1 11 Yes d
1 6 Yes d
1 7 Yes f
1 11 Yes f
1 6 Yes f
2 1 No A
2 2 No A
2 3 No A
2 1 No B
2 2 No B
2 3 No B
2 1 No C
2 2 No C
2 3 No C
3 7 No C
3 11 No C
3 6 No C
3 7 No H
3 11 No H
3 6 No H
3 7 No L
3 11 No L
3 6 No L
3 7 No Z
3 11 No Z
3 6 No Z
Upvotes: 1