Reputation: 61
i have a DataFrame with shape(12000, 21) that looks like this:
id CID U_lot P4 P5 P6 P7 P8 P9
0 A0694 M 0 1 0 1 1 0
1 A1486 M 0 0 1 0 0 0
2 C0973 S 0 1 1 0 0 0
3 B4251 D 0 0 0 1 0 1
4 I0041 S 1 0 0 1 1 0
5 J1102 F 0 0 0 0 0 1
how do i transform the DataFrame to look like this:
id CID U_lot P_lots Label
0 A0694 M [P5,P7] P8
1 A0694 M [P5,P8] P7
2 A0694 M [P7,P8] P5
3 A1486 M NAN P6
4 C0973 S [P5] P6
5 C0973 S [P6] P5
6 B4251 D [P7] P8
7 B4251 D [P8] P7
8 I0041 S [P4,P7] P8
9 I0041 S [P4,P8] P7
10 I0041 S [P7,P8] P4
11 J1102 F NAN P9
i have tried reversing pd.get_dummies but it dosen't seem to work.
Upvotes: 1
Views: 61
Reputation: 59579
Getting the list
column really kills the efficiency. But if it's necessary, first stack
(or melt
) the DataFrame into a long format. At this point also keep track of all of the rows we will need in the final output (necessary to get those NaN
rows later).
df1 = (df.set_index(['id', 'CID', 'U_lot'])
.stack()
.loc[lambda x: x!=0]
.reset_index(-1)
.drop(columns=0)
.rename(columns={'level_3': 'Label'}))
idx = df1.set_index('Label', append=True).index
Then we will merge that long DataFrame with itself so we can get all of the 'P_lots'
, excluding the label that is split out with a query
.
df1 = (df1.merge(df1, left_index=True, right_index=True, suffixes=['', '_r'])
.query('Label != Label_r'))
Finally, groupby
to get the list and reindex to get back the NaN
df1 = (df1.groupby(['id', 'CID', 'U_lot', 'Label'])
.agg(P_lot=('Label_r', list))
.reindex(idx)
.reset_index())
id CID U_lot Label P_lot
0 0 A0694 M P5 [P7, P8]
1 0 A0694 M P7 [P5, P8]
2 0 A0694 M P8 [P5, P7]
3 1 A1486 M P6 NaN
4 2 C0973 S P5 [P6]
5 2 C0973 S P6 [P5]
6 3 B4251 D P7 [P9]
7 3 B4251 D P9 [P7]
8 4 I0041 S P4 [P7, P8]
9 4 I0041 S P7 [P4, P8]
10 4 I0041 S P8 [P4, P7]
11 5 J1102 F P9 NaN
Upvotes: 3