Pythoner
Pythoner

Reputation: 650

Reversing a MultiLabelBinarizer to create a list within a column

In Python3, I have a starting dataframe in the format of a multilabel binary data:

df1:

"a" "b" "c" "d" "e"

 1   1   0   0   1
 0   0   1   0   1
 1   0   0   0   0
 0   1   1   0   1

What I need to achieve is this:

df2:

"a" "b" "c" "d" "e" "labels"

 1   1   0   0   1   ["a", "b", "e"]
 0   0   1   0   1   ["c", "e"]
 1   0   0   0   0   ["a"]
 0   1   1   0   1   ["b", "c", "e"]

To start, I tried using the inverse_transform() function from MultiLabelBinarizer from sklearn based on this previous stack question.

from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
mlb.fit(df1.columns)
mlb.inverse_transform(df1.values)

ValueError: Expected indicator for 15 classes, but got 5

I tried following the exact documentation from sklearn, but I am not sure where I went wrong. I tried tweaking a few of the parameters, but I do not understand what the issue is.

Upvotes: 2

Views: 882

Answers (4)

wwnde
wwnde

Reputation: 26686

df2=df.apply(lambda x:x>0)# come up with a boolean dataframe

l=df.columns.to_numpy() put column names into a numpy array

#Calculate column `labels` using list comprehension in a `pd.DataFrame()` method.
df['labels']=pd.DataFrame({'a':[l[i] for i in df2.to_numpy()]})

enter image description here

Upvotes: 0

piRSquared
piRSquared

Reputation: 294536

A Numpy approach

i, j = np.where(df)
a = df.columns.to_numpy()[j]
b = np.flatnonzero(np.diff(i)) + 1
df.assign(labels=np.split(a, b))

   a  b  c  d  e     labels
0  1  1  0  0  1  [a, b, e]
1  0  0  1  0  1     [c, e]
2  1  0  0  0  0        [a]
3  0  1  1  0  1  [b, c, e]

Upvotes: 3

BENY
BENY

Reputation: 323386

Let us try dot with str.split

df['labels'] = df.dot(df.columns+',').str[:-1].str.split(',')
0    ["a", "b", "e"]
1         ["c", "e"]
2              ["a"]
3    ["b", "c", "e"]
dtype: object

Upvotes: 3

Quang Hoang
Quang Hoang

Reputation: 150815

You can stack data, filter the values, and groupby:

df['labels'] = (df.stack()
   .loc[lambda x: x>0]
   .reset_index()
   .groupby('level_0')
   .agg({'level_1':list})
)

Output:

   "a"  "b"  "c"  "d"  "e"           labels
0    1    1    0    0    1  ["a", "b", "e"]
1    0    0    1    0    1       ["c", "e"]
2    1    0    0    0    0            ["a"]
3    0    1    1    0    1  ["b", "c", "e"]

Upvotes: 2

Related Questions