Reputation: 43
Beginner here. I want to use one hot encoding on my data frame that has multiple categorical data in one column. My data frame looks something like this, although with more things in the column such that I can't just do it manually:
Title column
Movie 1 Action, Fantasy
Movie 2 Fantasy, Drama
Movie 3 Action
Movie 4 Sci-Fi, Romance, Comedy
Movie 5 NA
etc.
My desired output:
Title Action Fantasy Drama Sci-Fi Romance Comedy
Movie 1 1 1 0 0 0 0
Movie 2 0 1 1 0 0 0
Movie 3 1 0 0 0 0 0
Movie 4 0 0 0 1 1 1
Movie 5 0 0 0 0 0 0
etc.
Thanks!
Upvotes: 4
Views: 3157
Reputation: 2019
Considering the input data as:
import pandas as pd
data = {'Title': ['Movie 1', 'Movie 2', 'Movie 3', 'Movie 4', 'Movie 5'],
'column': ['Action, Fantasy', 'Fantasy, Drama', 'Action', 'Sci-Fi, Romance, Comedy', np.nan]}
df = pd.DataFrame(data)
df
Title column
0 Movie 1 Action, Fantasy
1 Movie 2 Fantasy, Drama
2 Movie 3 Action
3 Movie 4 Sci-Fi, Romance, Comedy
4 Movie 5 NaN
This code produces the desired output:
# treat null values
df['column'].fillna('NA', inplace = True)
# separate all genres into one list, considering comma + space as separators
genre = df['column'].str.split(', ').tolist()
# flatten the list
flat_genre = [item for sublist in genre for item in sublist]
# convert to a set to make unique
set_genre = set(flat_genre)
# back to list
unique_genre = list(set_genre)
# remove NA
unique_genre.remove('NA')
# create columns by each unique genre
df = df.reindex(df.columns.tolist() + unique_genre, axis=1, fill_value=0)
# for each value inside column, update the dummy
for index, row in df.iterrows():
for val in row.column.split(', '):
if val != 'NA':
df.loc[index, val] = 1
df.drop('column', axis = 1, inplace = True)
df
Title Action Fantasy Comedy Sci-Fi Drama Romance
0 Movie 1 1 1 0 0 0 0
1 Movie 2 0 1 0 0 1 0
2 Movie 3 1 0 0 0 0 0
3 Movie 4 0 0 1 1 0 1
4 Movie 5 0 0 0 0 0 0
UPDATE: I've added a null value into the test data, and treat it appropriately in the first line of the solution.
Upvotes: 5
Reputation: 13
### Import libraries and load sample data
import numpy as np
import pandas as pd
data = {
'Movie 1': ['Action, Fantasy'],
'Movie 2': ['Fantasy, Drama'],
'Movie 3': ['Action'],
'Movie 4': ['Sci-Fi, Romance, Comedy'],
'Movie 5': ['NA'],
}
df = pd.DataFrame.from_dict(data, orient='index')
df.rename(columns={0:'column'}, inplace=True)
At this stage our DataFrame looks like this:
column
Movie 1 Action, Fantasy
Movie 2 Fantasy, Drama
Movie 3 Action
Movie 4 Sci-Fi, Romance, Comedy
Movie 5 NA
Now, the question we're asking is - does a given genre word ("sub-string") occur in 'column' for a given movie?
To do this we'll first need a list of genre words:
### Join every string in every row, split the result, pull out the unique values.
genres = np.unique(', '.join(df['column']).split(', '))
### Drop 'NA'
genres = np.delete(genres, np.where(genres == 'NA'))
Depending on how large your dataset is, this could be computationally costly. You mentioned that you know the unique values already. So you could just define the iterable 'genres' manually.
Getting the OneHotVectors:
for genre in genres:
df[genre] = df['column'].str.contains(genre).astype('int')
df.drop('column', axis=1, inplace=True)
We loop through each genre, we ask whether the genre exists in 'column', this returns a True or False, which is converted to 1 or 0 respectively - when we cast to type('int').
We end up with:
Action Comedy Drama Fantasy Romance Sci-Fi
Movie 1 1 0 0 1 0 0
Movie 2 0 0 1 1 0 0
Movie 3 1 0 0 0 0 0
Movie 4 0 1 0 0 1 1
Movie 5 0 0 0 0 0 0
Upvotes: 1