Flattening a column in dataframe that has nested dictionaries

Question

I know this has been done before but I am unable to accomplish it - I have read the data from kaffle - https://www.kaggle.com/rounakbanik/the-movies-dataset csv sample

adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,popularity,poster_path,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
False,"{'id': 10194, 'name': 'Toy Story Collection', 'poster_path': '/7G9915LfUQ2lVfwMEEhDsn3kT4B.jpg', 'backdrop_path': '/9FBwqcd9IRruEDUrTdcaafOMKUq.jpg'}",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, 'name': 'Comedy'}, {'id': 10751, 'name': 'Family'}]",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his room until Andy's birthday brings Buzz Lightyear onto the scene. Afraid of losing his place in Andy's heart, Woody plots against Buzz. But when circumstances separate Buzz and Woody from their owner, the duo eventually learns to put aside their differences.",21.946943,/rhIRbceoE9lR4veEXuwCC2wARtG.jpg,"[{'name': 'Pixar Animation Studios', 'id': 3}]","[{'iso_3166_1': 'US', 'name': 'United States of America'}]",1995-10-30,373554033,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415
False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, 'name': 'Fantasy'}, {'id': 10751, 'name': 'Family'}]",,8844,tt0113497,en,Jumanji,"When siblings Judy and Peter discover an enchanted board game that opens the door to a magical world, they unwittingly invite Alan -- an adult who's been trapped inside the game for 26 years -- into their living room. Alan's only hope for freedom is to finish the game, which proves risky as all three find themselves running from giant rhinoceroses, evil monkeys and other terrifying creatures.",17.015539,/vzmL6fP7aPKNKPRTFnZmiUfciyV.jpg,"[{'name': 'TriStar Pictures', 'id': 559}, {'name': 'Teitler Film', 'id': 2550}, {'name': 'Interscope Communications', 'id': 10201}]","[{'iso_3166_1': 'US', 'name': 'United States of America'}]",1995-12-15,262797249,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso_639_1': 'fr', 'name': 'Français'}]",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413
False,"{'id': 119050, 'name': 'Grumpy Old Men Collection', 'poster_path': '/nLvUdqgPgm3F85NMCii9gVFUcet.jpg', 'backdrop_path': '/hypTnLot2z8wpFS7qwsQHW1uV8u.jpg'}",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, 'name': 'Comedy'}]",,15602,tt0113228,en,Grumpier Old Men,"A family wedding reignites the ancient feud between next-door neighbors and fishing buddies John and Max. Meanwhile, a sultry Italian divorcée opens a restaurant at the local bait shop, alarming the locals who worry she'll scare the fish away. But she's less interested in seafood than she is in cooking up a hot time with Max.",11.7129,/6ksm1sjKMFLbO7UY2i6G1ju9SML.jpg,"[{'name': 'Warner Bros.', 'id': 6194}, {'name': 'Lancaster Gate', 'id': 19464}]","[{'iso_3166_1': 'US', 'name': 'United States of America'}]",1995-12-22,0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for Love.,Grumpier Old Men,False,6.5,92
False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'name': 'Drama'}, {'id': 10749, 'name': 'Romance'}]",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the women are holding their breath, waiting for the elusive ""good man"" to break a string of less-than-stellar lovers. Friends and confidants Vannah, Bernie, Glo and Robin talk it all out, determined to find a better way to breathe.",3.859495,/16XOMpEaLWkrcPqSQqhTmeJuqQl.jpg,"[{'name': 'Twentieth Century Fox Film Corporation', 'id': 306}]","[{'iso_3166_1': 'US', 'name': 'United States of America'}]",1995-12-22,81452156,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself... and never let you forget it.,Waiting to Exhale,False,6.1,34

  release_date                                             genres    budget
0   1995-10-30  [{'id': 16, 'name': 'Animation'}, {'id': 35, '...  30000000
1   1995-12-15  [{'id': 12, 'name': 'Adventure'}, {'id': 14, '...  65000000
2   1995-12-22  [{'id': 10749, 'name': 'Romance'}, {'id': 35, ...         0

' I am trying to normalize the data vertically and have tried this https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.json_normalize.html:

But my data is brought in from a csv and not in a json format. Wanting:

data = [{'state': 'Florida',
         'shortname': 'FL',
         'info': {'governor': 'Rick Scott'},
         'counties': [{'name': 'Dade', 'population': 12345},
                      {'name': 'Broward', 'population': 40000},
                      {'name': 'Palm Beach', 'population': 60000}]},
        {'state': 'Ohio',
         'shortname': 'OH',
         'info': {'governor': 'John Kasich'},
         'counties': [{'name': 'Summit', 'population': 1234},
                      {'name': 'Cuyahoga', 'population': 1337}]}]
result = pd.json_normalize(data, 'counties', ['state', 'shortname',
                                           ['info', 'governor']])
result
         name  population    state shortname info.governor
0        Dade       12345   Florida    FL    Rick Scott
1     Broward       40000   Florida    FL    Rick Scott
2  Palm Beach       60000   Florida    FL    Rick Scott
3      Summit        1234   Ohio       OH    John Kasich
4    Cuyahoga        1337   Ohio       OH    John Kasich

I want to break out the genre so it looks like this:

  release_date  genres.id          genres.name                     budget
0   1995-10-30  16                 Animation                      30000000
1   1995-10-30  35.                Genre                          30000000
2   1995-12-15  12                 Adventure                      65000000
3   1995-12-15  14                 Genre                          65000000

.
.
100 1995-12-22  10749              Romance                        0
101 1995-12-22  35                 Genre                          0

I have tried to use

for data in test:
    data_row = data['genres']
    time = data['release_date']
      
    for row in data_row:
        row['Time'] = time
        rows.append(row)

and

result = pd.json_normalize(test.to_dict(), 'genres', ['budget'])

But am unsuccessful as my file is not a json and get errors AttributeError: 'str' object has no attribute 'values' also I am not sure if I have researched the right keywords

Ank · Accepted Answer

Another way to do it using pandas.DataFrame.apply and pandas.DataFrame.explode:

df = df.loc[:,['release_date','genres','budget']]
df['genres'] = df.genres.apply(eval)
df = df.explode('genres').dropna()
df[['genres.id','genres.name']] = df.genres.apply(pd.Series)
df.drop('genres', axis=1, inplace=True)

Output:

>>> df
  release_date    budget  genres.id genres.name
0   1995-10-30  30000000         16   Animation
0   1995-10-30  30000000         35      Comedy
0   1995-10-30  30000000      10751      Family
1   1995-12-15  65000000         12   Adventure
1   1995-12-15  65000000         14     Fantasy
1   1995-12-15  65000000      10751      Family
2   1995-12-22         0      10749     Romance
2   1995-12-22         0         35      Comedy
3   1995-12-22  16000000         35      Comedy
3   1995-12-22  16000000         18       Drama
3   1995-12-22  16000000      10749     Romance

Steps:

Since the list of dictionaries in genres column are actually string values, apply eval to convert them to back to list of dictionaries.

Use pd.explode method to transform dictionaries in each list into separate row values (so only one dictionary in each row):

>>> df['genres'] = df.genres.apply(eval)
>>> df = df.explode('genres').dropna()
>>> df
  release_date                            genres    budget
0   1995-10-30   {'id': 16, 'name': 'Animation'}  30000000
0   1995-10-30      {'id': 35, 'name': 'Comedy'}  30000000
0   1995-10-30   {'id': 10751, 'name': 'Family'}  30000000
1   1995-12-15   {'id': 12, 'name': 'Adventure'}  65000000
1   1995-12-15     {'id': 14, 'name': 'Fantasy'}  65000000
1   1995-12-15   {'id': 10751, 'name': 'Family'}  65000000
2   1995-12-22  {'id': 10749, 'name': 'Romance'}         0
2   1995-12-22      {'id': 35, 'name': 'Comedy'}         0
3   1995-12-22      {'id': 35, 'name': 'Comedy'}  16000000
3   1995-12-22       {'id': 18, 'name': 'Drama'}  16000000
3   1995-12-22  {'id': 10749, 'name': 'Romance'}  16000000

Convert the dictionary key-values into separate columns by applying pandas.Series on it. Drop the old genres column:

>>> df[['genres.id','genres.name']] = df.genres.apply(pd.Series)
>>> df.drop('genres', axis=1, inplace=True)
>>> df
  release_date    budget  genres.id genres.name
0   1995-10-30  30000000         16   Animation
0   1995-10-30  30000000         35      Comedy
0   1995-10-30  30000000      10751      Family
1   1995-12-15  65000000         12   Adventure
1   1995-12-15  65000000         14     Fantasy
1   1995-12-15  65000000      10751      Family
2   1995-12-22         0      10749     Romance
2   1995-12-22         0         35      Comedy
3   1995-12-22  16000000         35      Comedy
3   1995-12-22  16000000         18       Drama
3   1995-12-22  16000000      10749     Romance

Flattening a column in dataframe that has nested dictionaries

Answers (2)

Steps:

Related Questions