Reputation: 51
I am trying to stack this table based on ID column but only considering columns [A-D] where the value is 1 and not 0.
Current df:
ID | A | B | C | D |
---|---|---|---|---|
1 | 1 | 0 | 0 | 1 |
3 | 0 | 1 | 0 | 1 |
7 | 1 | 0 | 1 | 1 |
8 | 1 | 0 | 0 | 0 |
What I want:
ID | LETTER |
---|---|
1 | A |
1 | D |
3 | B |
3 | D |
7 | A |
7 | C |
7 | D |
8 | A |
The following code works but I need a more efficient solution as I have a df with 93434 rows x 12377 columns.
stacked_df = df.set_index('ID').stack().reset_index(name='has_letter').rename(columns={'level_1':'LETTER'})
stacked_df = stacked_df[stacked_df['has_letter']==1].reset_index(drop=True)
stacked_df.drop(['has_letter'], axis=1, inplace=True)
Upvotes: 1
Views: 274
Reputation: 23499
If the column labels are just letters, then you can use .dot()
product to multiply the column labels with the dummy values and expand the resulting strings to for the desired result.
(
df.set_index('ID')
.dot(df.columns[1:]) # use inner product of column names and values
.apply(list) # separate each letter
.explode() # explode each list
.reset_index(name='LETTER') # reset index for df
)
However, if the column labels are words (e.g. instead of "A"
, it is "col1"
), then try the following approach where instead of .dot()
, .mul()
is used to convert the column labels into column values.
(
df
.set_index('ID')
.mul(df.columns[1:])
.stack()
.loc[lambda x: x!=""]
.droplevel(level=1)
.reset_index(name="LETTERS")
)
Given the input, both methods result in the expected output:
ID LETTER
0 1 A
1 1 D
2 3 B
3 3 D
4 7 A
5 7 C
6 7 D
7 8 A
Upvotes: 2
Reputation: 195643
Try:
print(
df.set_index("ID")
.apply(lambda x: x.index[x == 1], axis=1)
.reset_index()
.explode(0)
.rename(columns={0: "LETTERS"})
)
Prints:
ID LETTERS
0 1 A
0 1 D
1 3 B
1 3 D
2 7 A
2 7 C
2 7 D
3 8 A
Or:
x = df.set_index("ID").stack()
print(
x[x == 1]
.reset_index()
.drop(columns=0)
.rename(columns={"level_1": "LETTER"})
)
Prints:
ID LETTER
0 1 A
1 1 D
2 3 B
3 3 D
4 7 A
5 7 C
6 7 D
7 8 A
Upvotes: 2
Reputation: 262634
You can mask the non-1 values and stack
to remove the NaNs:
df2 = df.rename_axis(columns='LETTERS').set_index('ID')
stacked_df = (df2.where(df2.eq(1)).stack()
.reset_index().iloc[:,:2]
)
Output:
ID LETTERS
0 1 A
1 1 D
2 3 B
3 3 D
4 7 A
5 7 C
6 7 D
7 8 A
Upvotes: 2