razortight
razortight

Reputation: 43

One Hot Encoding Multiple Categorical Data in a Column

Beginner here. I want to use one hot encoding on my data frame that has multiple categorical data in one column. My data frame looks something like this, although with more things in the column such that I can't just do it manually:

Title       column
Movie 1   Action, Fantasy
Movie 2   Fantasy, Drama
Movie 3   Action
Movie 4   Sci-Fi, Romance, Comedy
Movie 5   NA
etc.

My desired output:

 Title     Action  Fantasy  Drama  Sci-Fi  Romance  Comedy
Movie 1     1       1        0      0        0       0
Movie 2     0       1        1      0        0       0
Movie 3     1       0        0      0        0       0
Movie 4     0       0        0      1        1       1
Movie 5     0       0        0      0        0       0  
etc.

Thanks!

Upvotes: 4

Views: 3157

Answers (2)

Daniel Labbe
Daniel Labbe

Reputation: 2019

Considering the input data as:

import pandas as pd
data = {'Title': ['Movie 1', 'Movie 2', 'Movie 3', 'Movie 4', 'Movie 5'], 
        'column': ['Action, Fantasy', 'Fantasy, Drama', 'Action', 'Sci-Fi, Romance, Comedy', np.nan]}
df = pd.DataFrame(data)
df
    Title   column
0   Movie 1 Action, Fantasy
1   Movie 2 Fantasy, Drama
2   Movie 3 Action
3   Movie 4 Sci-Fi, Romance, Comedy
4   Movie 5 NaN

This code produces the desired output:

# treat null values
df['column'].fillna('NA', inplace = True)

# separate all genres into one list, considering comma + space as separators
genre = df['column'].str.split(', ').tolist()

# flatten the list
flat_genre = [item for sublist in genre for item in sublist]

# convert to a set to make unique
set_genre = set(flat_genre)

# back to list
unique_genre = list(set_genre)

# remove NA
unique_genre.remove('NA')

# create columns by each unique genre
df = df.reindex(df.columns.tolist() + unique_genre, axis=1, fill_value=0)

# for each value inside column, update the dummy
for index, row in df.iterrows():
    for val in row.column.split(', '):
        if val != 'NA':
            df.loc[index, val] = 1

df.drop('column', axis = 1, inplace = True)    
df
    Title   Action  Fantasy Comedy  Sci-Fi  Drama   Romance
0   Movie 1 1       1       0       0       0       0
1   Movie 2 0       1       0       0       1       0
2   Movie 3 1       0       0       0       0       0
3   Movie 4 0       0       1       1       0       1
4   Movie 5 0       0       0       0       0       0

UPDATE: I've added a null value into the test data, and treat it appropriately in the first line of the solution.

Upvotes: 5

Maaz Basar
Maaz Basar

Reputation: 13

### Import libraries and load sample data

import numpy as np
import pandas as pd

data = {
    'Movie 1': ['Action, Fantasy'],
    'Movie 2': ['Fantasy, Drama'],
    'Movie 3': ['Action'],
    'Movie 4': ['Sci-Fi, Romance, Comedy'],
    'Movie 5': ['NA'],
}

df = pd.DataFrame.from_dict(data, orient='index')
df.rename(columns={0:'column'}, inplace=True)

At this stage our DataFrame looks like this:

           column
Movie 1    Action, Fantasy
Movie 2    Fantasy, Drama
Movie 3    Action
Movie 4    Sci-Fi, Romance, Comedy
Movie 5    NA

Now, the question we're asking is - does a given genre word ("sub-string") occur in 'column' for a given movie?

To do this we'll first need a list of genre words:

### Join every string in every row, split the result, pull out the unique values.
genres = np.unique(', '.join(df['column']).split(', '))
### Drop 'NA'
genres = np.delete(genres, np.where(genres == 'NA'))

Depending on how large your dataset is, this could be computationally costly. You mentioned that you know the unique values already. So you could just define the iterable 'genres' manually.

Getting the OneHotVectors:

for genre in genres:
    df[genre] = df['column'].str.contains(genre).astype('int')

df.drop('column', axis=1, inplace=True)

We loop through each genre, we ask whether the genre exists in 'column', this returns a True or False, which is converted to 1 or 0 respectively - when we cast to type('int').

We end up with:

          Action    Comedy  Drama   Fantasy Romance Sci-Fi
Movie 1        1         0      0         1       0      0
Movie 2        0         0      1         1       0      0
Movie 3        1         0      0         0       0      0
Movie 4        0         1      0         0       1      1
Movie 5        0         0      0         0       0      0

Upvotes: 1

Related Questions