Split Names in column

Question

I'm fairly new to python and working with a DataFrame in pandas & numpy from The Movie Database. One of the columns notes the main cast of each movie separated by the pipe symbol (|). I'm trying to find a way to split each individual cast member and list it in its own row with the movie title. I've attached a snippet below of the results I get.

tmdb_data = pd.read_csv('tmdb-movies.csv')
cast_split = tmdb_data[['original_title', 'cast']]
df = pd.DataFrame(cast_split)
df.head()

Movie Title & Cast

Expected Output:

    original_title                 cast
0  Jursassic World           Chris Patt
1  Jursassic World  Bryce Dallas Howard
2  Jursassic World          Irrfan Khan

jezrael · Accepted Answer

Use pop + split + stack + rename + reset_index for new Series and then join to original:

tmdb_data = pd.DataFrame({'movie':['Jursassic World', 'Insurgent'], 
                           'cast':['Chris Patt|Bryce Dallas Howard|Irrfan Khan', 
                                   'Shailene Woodley|Theo James']}, 
                         columns=['movie', 'cast'])


print (tmdb_data)
             movie                                        cast
0  Jursassic World  Chris Patt|Bryce Dallas Howard|Irrfan Khan
1        Insurgent                 Shailene Woodley|Theo James

df1 = (tmdb_data.join(tmdb_data.pop('cast').str.split('|', expand=True)
                               .stack()
                               .reset_index(level=1, drop=True)
                               .rename('cast'))
                               .reset_index(drop=True))

print (df1)

             movie                 cast
0  Jursassic World           Chris Patt
1  Jursassic World  Bryce Dallas Howard
2  Jursassic World          Irrfan Khan
3        Insurgent     Shailene Woodley
4        Insurgent           Theo James

Split Names in column

Answers (2)

Related Questions