Reputation: 650
In Python3, I have a starting dataframe in the format of a multilabel binary data:
df1:
"a" "b" "c" "d" "e"
1 1 0 0 1
0 0 1 0 1
1 0 0 0 0
0 1 1 0 1
What I need to achieve is this:
df2:
"a" "b" "c" "d" "e" "labels"
1 1 0 0 1 ["a", "b", "e"]
0 0 1 0 1 ["c", "e"]
1 0 0 0 0 ["a"]
0 1 1 0 1 ["b", "c", "e"]
To start, I tried using the inverse_transform() function from MultiLabelBinarizer from sklearn based on this previous stack question.
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
mlb.fit(df1.columns)
mlb.inverse_transform(df1.values)
ValueError: Expected indicator for 15 classes, but got 5
I tried following the exact documentation from sklearn, but I am not sure where I went wrong. I tried tweaking a few of the parameters, but I do not understand what the issue is.
Upvotes: 2
Views: 882
Reputation: 26686
df2=df.apply(lambda x:x>0)# come up with a boolean dataframe
l=df.columns.to_numpy() put column names into a numpy array
#Calculate column `labels` using list comprehension in a `pd.DataFrame()` method.
df['labels']=pd.DataFrame({'a':[l[i] for i in df2.to_numpy()]})
Upvotes: 0
Reputation: 294536
i, j = np.where(df)
a = df.columns.to_numpy()[j]
b = np.flatnonzero(np.diff(i)) + 1
df.assign(labels=np.split(a, b))
a b c d e labels
0 1 1 0 0 1 [a, b, e]
1 0 0 1 0 1 [c, e]
2 1 0 0 0 0 [a]
3 0 1 1 0 1 [b, c, e]
Upvotes: 3
Reputation: 323386
Let us try dot
with str.split
df['labels'] = df.dot(df.columns+',').str[:-1].str.split(',')
0 ["a", "b", "e"]
1 ["c", "e"]
2 ["a"]
3 ["b", "c", "e"]
dtype: object
Upvotes: 3
Reputation: 150815
You can stack
data, filter the values, and groupby:
df['labels'] = (df.stack()
.loc[lambda x: x>0]
.reset_index()
.groupby('level_0')
.agg({'level_1':list})
)
Output:
"a" "b" "c" "d" "e" labels
0 1 1 0 0 1 ["a", "b", "e"]
1 0 0 1 0 1 ["c", "e"]
2 1 0 0 0 0 ["a"]
3 0 1 1 0 1 ["b", "c", "e"]
Upvotes: 2