Zafarullah Mahmood
Zafarullah Mahmood

Reputation: 306

Pandas: How to prepare a Multi-Label Dataset?

I have function get_tags which returns a list of labels corresponding to a text:

def get_tags(text):
    # Do some analysis and return a list of tags
    return tags

E.g., get_tags(text1) returns ['a', 'b', 'c'] while get_tags(text2) returns ['a', 'b']

I also have a pandas DataFrame df with columns [text, a, b, c, d, e, f] having 500,000 rows. I want to fill 1's as labels to the text in a particular row. Right now, I am executing

for i in range(len(df)):
    df.loc[i, get_tags(df.loc[i, "text"])] = 1

This is painfully slow. I can use joblib but before that I want to see the most efficient way to achieve this.

Before execution, df looks like this: text a b c d e f 0 text having a, b, c tags 0 0 0 0 0 0 1 text having a, c tags 0 0 0 0 0 0 2 text having a, b, f tags 0 0 0 0 0 0

After the execution, it should look like this: text a b c d e f 0 text having a, b, c tags 1 1 1 0 0 0 1 text having a, c tags 1 0 1 0 0 0 2 text having a, b, f tags 1 1 0 0 0 1

Upvotes: 5

Views: 4497

Answers (2)

Uzii
Uzii

Reputation: 31

df is your raw dataframe,
we can also use MultiLabelBinarizer in sklearn.preprocessing.

Before execution, df is:

--------------------------
|   | text | labels       | 
--------------------------
| 0 |  A   | a, b, c      |
--------------------------
| 1 |  B   | a, c         |
--------------------------
| 2 |  C   | a, b, f      |
--------------------------

do as follows:

from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
mlb_result = mlb.fit_transform([str(df.loc[i,'labels']).split(',') for i in range(len(df))])
df_final = pd.concat([df['text'],pd.DataFrame(mlb_result,columns=list(mlb.classes_))],axis=1)

After execution, df_final is:

------------------------------------
|   | text | a | b | c | d | e | f | 
------------------------------------
| 0 |  A   | 1 | 1 | 1 | 0 | 0 | 0 | 
------------------------------------
| 1 |  B   | 1 | 0 | 1 | 0 | 0 | 0 | 
------------------------------------
| 2 |  C   | 1 | 1 | 0 | 0 | 0 | 1 | 
------------------------------------

df_final will be what you want.

Upvotes: 3

FilippoBu
FilippoBu

Reputation: 100

I'm not sure if this speeds things up but it should work:

for i in ['a','b','c','d','e','f']:
    df[i] = [(1 if i in element else 0) for element in df['text'].apply(get_tags)]

Upvotes: 0

Related Questions