fati
fati

Reputation: 79

How to binary encode multi-valued categorical variable in pandas?

I have the following dataframe with multiple values for a certain column:

             my column

         0 - ["A", "B"]
         1 - ["B", "C", "D"]
         2 - ["B", "D"]

How Can I get a dataframe like this : (where every column takes the name of the values in "my column")

         "A"  "B"  "C"  "D"
      0 - 1    1    0    0
      1 - 0    1    1    1
      2 - 0    1    0    1

Upvotes: 2

Views: 1744

Answers (4)

Rahul Singh
Rahul Singh

Reputation: 184

You can use CountVectorizer, It is specialy designed for this purpose. It takes corpus of text and do One-Hot Encoding for it.

Note : I am using 'Cat', 'Dog', 'Cow', 'Tiger' Instead of 'A', 'B', 'C', 'D'

Code :

Imports :

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

Method to convert elemrnts of list to string :

def get_string(listt):
  return ' '.join(listt)

Creating DataFrame from List :

my_column = pd.Series([['Cat','Dog'],['Dog','Cow','Tiger'],['Dog','Tiger']])
df = pd.DataFrame(my_column, columns=['my_column'])
print(df)
df['text_data'] = df.my_column.apply(get_string)
print(df)

Performing Text Vectorization : tf_vectorizer = CountVectorizer( stop_words=None) vectorized_data = tf_vectorizer.fit_transform(df.text_data)

Preparing final DataFrame :

final_df = pd.DataFrame(vectorized_data.toarray(),columns=tf_vectorizer.get_feature_names())
print(final_df)

Out Put :

Our DataFrame :

           my_column
0         [Cat, Dog]
1  [Dog, Cow, Tiger]
2       [Dog, Tiger]

DataFrame with text column:

           my_column      text_data
0         [Cat, Dog]        Cat Dog
1  [Dog, Cow, Tiger]  Dog Cow Tiger
2       [Dog, Tiger]      Dog Tiger

Expected Result :

   cat  cow  dog  tiger
0    1    0    1      0
1    0    1    1      1
2    0    0    1      1

Upvotes: 1

neutrino_logic
neutrino_logic

Reputation: 1299

Just for fun here's a naive implementation of dummies:

import pandas as pd

my_column = pd.Series([['A','B'],['B','C','D'],['B','D']])

frameA = pd.DataFrame(my_column, columns=['my_column'])
#extract all new headers from the DataFrame rows, in order:
headers = sorted(list(set([x for y in frame['my_column'] for x in y])))
#make a list of the DataFrame rows (stored as lists):
rows = [y for x in range(len(frame)) for y in frame.loc[x]]

builder = {}               #construct a dictionary to build a new DataFrame from
for header in headers:
    column = []
    for row in rows:
        if header in row:
            column.append(1)
        else:
            column.append(0)
    builder.update({header:column})

frameB = pd.DataFrame(builder)

print(frameB)

results in:

   A  B  C  D
0  1  1  0  0 
1  0  1  1  1
2  0  1  0  1

Upvotes: 1

jezrael
jezrael

Reputation: 862761

If there are lists in column use Series.str.join with Series.str.get_dummies:

df = df['my column'].str.join('|').str.get_dummies()
print (df)
   A  B  C  D
0  1  1  0  0
1  0  1  1  1
2  0  1  0  1

Or MultiLabelBinarizer:

from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer()
df = pd.DataFrame(mlb.fit_transform(df['my column']),columns=mlb.classes_)
print (df)
   A  B  C  D
0  1  1  0  0
1  0  1  1  1
2  0  1  0  1

If there are strings use Series.str.strip with str.get_dummies and last if necessary remove " from columns names:

df = (df['my column'].str.strip('[]')
                     .str.get_dummies(', ')
                     .rename(columns=lambda x: x.strip('"')))
print (df)
   A  B  C  D
0  1  1  0  0
1  0  1  1  1
2  0  1  0  1

Upvotes: 3

Max Feinberg
Max Feinberg

Reputation: 840

I think what you're looking for is theget_dummies() function in pandas which you can find the documentation for here

From the documentation:

s = pd.Series(list('abca'))
pd.get_dummies(s)

   a  b  c
0  1  0  0
1  0  1  0
2  0  0  1
3  1  0  0

Upvotes: 0

Related Questions