One Hot Encoding Multiple Categorical Data in a Column

Question

Beginner here. I want to use one hot encoding on my data frame that has multiple categorical data in one column. My data frame looks something like this, although with more things in the column such that I can't just do it manually:

Title       column
Movie 1   Action, Fantasy
Movie 2   Fantasy, Drama
Movie 3   Action
Movie 4   Sci-Fi, Romance, Comedy
Movie 5   NA
etc.

My desired output:

 Title     Action  Fantasy  Drama  Sci-Fi  Romance  Comedy
Movie 1     1       1        0      0        0       0
Movie 2     0       1        1      0        0       0
Movie 3     1       0        0      0        0       0
Movie 4     0       0        0      1        1       1
Movie 5     0       0        0      0        0       0  
etc.

Thanks!

Daniel Labbe · Accepted Answer

Considering the input data as:

import pandas as pd
data = {'Title': ['Movie 1', 'Movie 2', 'Movie 3', 'Movie 4', 'Movie 5'], 
        'column': ['Action, Fantasy', 'Fantasy, Drama', 'Action', 'Sci-Fi, Romance, Comedy', np.nan]}
df = pd.DataFrame(data)
df
    Title   column
0   Movie 1 Action, Fantasy
1   Movie 2 Fantasy, Drama
2   Movie 3 Action
3   Movie 4 Sci-Fi, Romance, Comedy
4   Movie 5 NaN

This code produces the desired output:

# treat null values
df['column'].fillna('NA', inplace = True)

# separate all genres into one list, considering comma + space as separators
genre = df['column'].str.split(', ').tolist()

# flatten the list
flat_genre = [item for sublist in genre for item in sublist]

# convert to a set to make unique
set_genre = set(flat_genre)

# back to list
unique_genre = list(set_genre)

# remove NA
unique_genre.remove('NA')

# create columns by each unique genre
df = df.reindex(df.columns.tolist() + unique_genre, axis=1, fill_value=0)

# for each value inside column, update the dummy
for index, row in df.iterrows():
    for val in row.column.split(', '):
        if val != 'NA':
            df.loc[index, val] = 1

df.drop('column', axis = 1, inplace = True)    
df
    Title   Action  Fantasy Comedy  Sci-Fi  Drama   Romance
0   Movie 1 1       1       0       0       0       0
1   Movie 2 0       1       0       0       1       0
2   Movie 3 1       0       0       0       0       0
3   Movie 4 0       0       1       1       0       1
4   Movie 5 0       0       0       0       0       0

UPDATE: I've added a null value into the test data, and treat it appropriately in the first line of the solution.

One Hot Encoding Multiple Categorical Data in a Column

Answers (2)

Related Questions