fredcoremusic
fredcoremusic

Reputation: 25

Split Names in column

I'm fairly new to python and working with a DataFrame in pandas & numpy from The Movie Database. One of the columns notes the main cast of each movie separated by the pipe symbol (|). I'm trying to find a way to split each individual cast member and list it in its own row with the movie title. I've attached a snippet below of the results I get.

tmdb_data = pd.read_csv('tmdb-movies.csv')
cast_split = tmdb_data[['original_title', 'cast']]
df = pd.DataFrame(cast_split)
df.head()

Movie Title & Cast

Expected Output:

    original_title                 cast
0  Jursassic World           Chris Patt
1  Jursassic World  Bryce Dallas Howard
2  Jursassic World          Irrfan Khan

Upvotes: 2

Views: 127

Answers (2)

jezrael
jezrael

Reputation: 863481

Use pop + split + stack + rename + reset_index for new Series and then join to original:


tmdb_data = pd.DataFrame({'movie':['Jursassic World', 'Insurgent'], 
                           'cast':['Chris Patt|Bryce Dallas Howard|Irrfan Khan', 
                                   'Shailene Woodley|Theo James']}, 
                         columns=['movie', 'cast'])


print (tmdb_data)
             movie                                        cast
0  Jursassic World  Chris Patt|Bryce Dallas Howard|Irrfan Khan
1        Insurgent                 Shailene Woodley|Theo James

df1 = (tmdb_data.join(tmdb_data.pop('cast').str.split('|', expand=True)
                               .stack()
                               .reset_index(level=1, drop=True)
                               .rename('cast'))
                               .reset_index(drop=True))

print (df1)

             movie                 cast
0  Jursassic World           Chris Patt
1  Jursassic World  Bryce Dallas Howard
2  Jursassic World          Irrfan Khan
3        Insurgent     Shailene Woodley
4        Insurgent           Theo James

Upvotes: 1

jpp
jpp

Reputation: 164793

First cast as a list (pardon the pun!), then rebuild dataframe via numpy:

import pandas as pd
import numpy as np

df = pd.DataFrame([['Jursassic World', 'Chris Patt|Bryce Dallas Howard']], columns=['movie', 'cast'])

df.cast = df.cast.str.split('|')

df2 = pd.DataFrame({'movie': np.repeat(df.movie.values, df.cast.str.len()),
                    'cast': np.concatenate(df.cast.values)})

#                   cast            movie
# 0           Chris Patt  Jursassic World
# 1  Bryce Dallas Howard  Jursassic World

Upvotes: 1

Related Questions