E. Camus
E. Camus

Reputation: 149

Creating new boolean pandas columns based on presence in a list

I have a pandas dataframe that consists of standard numerical columns and additional column (char) that contains a list of values.

Instead of having these encoded as a list, I would like to create columns for each possible value across all the lists and encode whether the list contains each of the possible values as a boolean column for each unique value.

Input - Dataframe

      char    id                                                                                                                      
1  [a, b, c]   2                                                                                                                      
2     [a, c]   3                                                                                                                      
5  [c, a, d]   6   
7  []          8 

Desired output - each potential list value gets a column, its presence in the char list will mark it true. It's important that empty lists equal false for all the created columns.

     id  char a     char b    char c   char d                                                                                                            
1     2   True       True      True     False                                                                                               
2     3   True       False     True     False                                                                                                  
5     6   True       False     True     True
7     8   False      False     False    False

What is an efficient solution to solve this that scales for any amount of potential values that can be contained in the list (char in this instance)?

Upvotes: 0

Views: 803

Answers (1)

BENY
BENY

Reputation: 323226

Let us check explode after pandas 0.25

s=df.char.explode().str.get_dummies().sum(level=0).astype(bool)
df=df.join(s)

Or we can do the sklearn

from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
s=pd.DataFrame(mlb.fit_transform(df['char']),columns=mlb.classes_, index=df.index).astype(bool)
df=df.join(s)

Upvotes: 2

Related Questions