Reputation: 25
I'm fairly new to python and working with a DataFrame in pandas & numpy from The Movie Database. One of the columns notes the main cast of each movie separated by the pipe symbol (|). I'm trying to find a way to split each individual cast member and list it in its own row with the movie title. I've attached a snippet below of the results I get.
tmdb_data = pd.read_csv('tmdb-movies.csv')
cast_split = tmdb_data[['original_title', 'cast']]
df = pd.DataFrame(cast_split)
df.head()
Expected Output:
original_title cast
0 Jursassic World Chris Patt
1 Jursassic World Bryce Dallas Howard
2 Jursassic World Irrfan Khan
Upvotes: 2
Views: 127
Reputation: 863481
Use pop
+ split
+ stack
+ rename
+ reset_index
for new Series
and then join
to original:
tmdb_data = pd.DataFrame({'movie':['Jursassic World', 'Insurgent'],
'cast':['Chris Patt|Bryce Dallas Howard|Irrfan Khan',
'Shailene Woodley|Theo James']},
columns=['movie', 'cast'])
print (tmdb_data)
movie cast
0 Jursassic World Chris Patt|Bryce Dallas Howard|Irrfan Khan
1 Insurgent Shailene Woodley|Theo James
df1 = (tmdb_data.join(tmdb_data.pop('cast').str.split('|', expand=True)
.stack()
.reset_index(level=1, drop=True)
.rename('cast'))
.reset_index(drop=True))
print (df1)
movie cast
0 Jursassic World Chris Patt
1 Jursassic World Bryce Dallas Howard
2 Jursassic World Irrfan Khan
3 Insurgent Shailene Woodley
4 Insurgent Theo James
Upvotes: 1
Reputation: 164793
First cast as a list (pardon the pun!), then rebuild dataframe via numpy
:
import pandas as pd
import numpy as np
df = pd.DataFrame([['Jursassic World', 'Chris Patt|Bryce Dallas Howard']], columns=['movie', 'cast'])
df.cast = df.cast.str.split('|')
df2 = pd.DataFrame({'movie': np.repeat(df.movie.values, df.cast.str.len()),
'cast': np.concatenate(df.cast.values)})
# cast movie
# 0 Chris Patt Jursassic World
# 1 Bryce Dallas Howard Jursassic World
Upvotes: 1