Reputation: 5
I am trying to make different columns from separated strings. My datasource is the https://grouplens.org/datasets/movielens/ ml-latest-small.zip (size: 1 MB)
movie_df = pd.read_csv('movies.csv')
movie_df.head(10)
Reading in the file, I have raw dataframe
I tried to do
movies_df = pd.read_csv('movies.csv', sep='|', encoding='latin-1',
names=['movie_id', 'movie_title','unknown', 'action','adventure', 'animation', 'childrens', 'comedy', 'crime', 'documentary', 'drama', 'fantasy','film_noir', 'horror', 'musical', 'mystery', 'romance', 'sci_fi', 'thriller', 'war', 'western'])
movies_df.head(10)
but this squishes everything before the separator to the first column and the first split on my genre also goes to the first column. Otherwise, it is what I need. See here.
How do I get all my genres of varying lengths to become a unique column after movieId and title? I want each genre to be a column with NaNs if it is not that column to set up for creating dummy variables later.
Edit: I did movies_df.head(10).to_dict()
and the output was:
'title': {0: 'Toy Story (1995)',
1: 'Jumanji (1995)',
2: 'Grumpier Old Men (1995)',
3: 'Waiting to Exhale (1995)',
4: 'Father of the Bride Part II (1995)',
5: 'Heat (1995)',
6: 'Sabrina (1995)',
7: 'Tom and Huck (1995)',
8: 'Sudden Death (1995)',
9: 'GoldenEye (1995)'},
'genres': {0: 'Adventure|Animation|Children|Comedy|Fantasy',
1: 'Adventure|Children|Fantasy',
2: 'Comedy|Romance',
3: 'Comedy|Drama|Romance',
4: 'Comedy',
5: 'Action|Crime|Thriller',
6: 'Comedy|Romance',
7: 'Adventure|Children',
8: 'Action',
9: 'Action|Adventure|Thriller'}}
Upvotes: 0
Views: 118
Reputation: 4772
The following seems to work. That said, ideally it should be changed to avoid iterating through the genres column in order to get the list of genres, since looping through columns is slow.
movie_df = pd.read_csv('movies.csv')
genre_set = set()
for lst in movie_df['genres'].str.split('|'):
genre_set.update(lst)
for g in genre_set:
movie_df[g] = np.nan
movie_df.loc[movie_df['genres'].str.contains(g),g] = g
The first for loop can be replaced with the single line genre_set.update(*movie_df['genres'].str.split('|'))
; I don't believe this changes performance.
The resulting frame movie_df
looks like this:
title \
0 Toy Story (1995)
1 Jumanji (1995)
2 Grumpier Old Men (1995)
3 Waiting to Exhale (1995)
4 Father of the Bride Part II (1995)
5 Heat (1995)
6 Sabrina (1995)
7 Tom and Huck (1995)
8 Sudden Death (1995)
9 GoldenEye (1995)
genres Action Romance Thriller \
0 Adventure|Animation|Children|Comedy|Fantasy NaN NaN NaN
1 Adventure|Children|Fantasy NaN NaN NaN
2 Comedy|Romance NaN Romance NaN
3 Comedy|Drama|Romance NaN Romance NaN
4 Comedy NaN NaN NaN
5 Action|Crime|Thriller Action NaN Thriller
6 Comedy|Romance NaN Romance NaN
7 Adventure|Children NaN NaN NaN
8 Action Action NaN NaN
9 Action|Adventure|Thriller Action NaN Thriller
Adventure Crime Children Comedy Drama Animation Fantasy
0 Adventure NaN Children Comedy NaN Animation Fantasy
1 Adventure NaN Children NaN NaN NaN Fantasy
2 NaN NaN NaN Comedy NaN NaN NaN
3 NaN NaN NaN Comedy Drama NaN NaN
4 NaN NaN NaN Comedy NaN NaN NaN
5 NaN Crime NaN NaN NaN NaN NaN
6 NaN NaN NaN Comedy NaN NaN NaN
7 Adventure NaN Children NaN NaN NaN NaN
8 NaN NaN NaN NaN NaN NaN NaN
9 Adventure NaN NaN NaN NaN NaN NaN
Upvotes: 1