Reputation: 115
I am trying to parse a movie database with Python 3. How can I parse genres of a movie with different variables? For example:
1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
2,Jumanji (1995),Adventure|Children|Fantasy
First value is movie_id, second is movie_name, and the third values are genres but I want to parse them as separate variables that belong to corresponding movie. In other words, I want second separator to my database as "|". How can I achieve this? Here is my code:
import numpy as np
import pandas as pd
header = ["movie_id", "title", "genres"]
movie_db = pd.read_csv("movielens/movies.csv", sep=",", names=header)
Upvotes: 1
Views: 751
Reputation: 862641
You can use separator ,|
but is necessary first row have to contains all possible genres:
df = pd.read_csv("movielens/movies.csv", sep="[,|]", header=None, engine='python')
print (df)
0 1 2 3 4 5 6
0 1 Toy Story (1995) Adventure Animation Children Comedy Fantasy
1 2 Jumanji (1995) Adventure Children Fantasy None None
But here is better create new columns by categories and set to 1
if category exist in row by get_dummies
and add to original columns by join
:
movie_db = pd.read_csv("movielens/movies.csv", sep=",", names=header)
df = movie_db.join(movie_db.pop('genres').str.get_dummies())
print (df)
movie_id title Adventure Animation Children Comedy Fantasy
0 1 Toy Story (1995) 1 1 1 1 1
1 2 Jumanji (1995) 1 0 1 0 1
But if need columns is possible use split
by |
:
df = movie_db.join(movie_db.pop('genres').str.split('|', expand=True))
print (df)
movie_id title 0 1 2 3 4
0 1 Toy Story (1995) Adventure Animation Children Comedy Fantasy
1 2 Jumanji (1995) Adventure Children Fantasy None None
Upvotes: 2