enavuio
enavuio

Reputation: 1500

Flattening a column in dataframe that has nested dictionaries

I know this has been done before but I am unable to accomplish it - I have read the data from kaffle - https://www.kaggle.com/rounakbanik/the-movies-dataset csv sample

adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,popularity,poster_path,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
False,"{'id': 10194, 'name': 'Toy Story Collection', 'poster_path': '/7G9915LfUQ2lVfwMEEhDsn3kT4B.jpg', 'backdrop_path': '/9FBwqcd9IRruEDUrTdcaafOMKUq.jpg'}",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, 'name': 'Comedy'}, {'id': 10751, 'name': 'Family'}]",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his room until Andy's birthday brings Buzz Lightyear onto the scene. Afraid of losing his place in Andy's heart, Woody plots against Buzz. But when circumstances separate Buzz and Woody from their owner, the duo eventually learns to put aside their differences.",21.946943,/rhIRbceoE9lR4veEXuwCC2wARtG.jpg,"[{'name': 'Pixar Animation Studios', 'id': 3}]","[{'iso_3166_1': 'US', 'name': 'United States of America'}]",1995-10-30,373554033,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415
False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, 'name': 'Fantasy'}, {'id': 10751, 'name': 'Family'}]",,8844,tt0113497,en,Jumanji,"When siblings Judy and Peter discover an enchanted board game that opens the door to a magical world, they unwittingly invite Alan -- an adult who's been trapped inside the game for 26 years -- into their living room. Alan's only hope for freedom is to finish the game, which proves risky as all three find themselves running from giant rhinoceroses, evil monkeys and other terrifying creatures.",17.015539,/vzmL6fP7aPKNKPRTFnZmiUfciyV.jpg,"[{'name': 'TriStar Pictures', 'id': 559}, {'name': 'Teitler Film', 'id': 2550}, {'name': 'Interscope Communications', 'id': 10201}]","[{'iso_3166_1': 'US', 'name': 'United States of America'}]",1995-12-15,262797249,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso_639_1': 'fr', 'name': 'Français'}]",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413
False,"{'id': 119050, 'name': 'Grumpy Old Men Collection', 'poster_path': '/nLvUdqgPgm3F85NMCii9gVFUcet.jpg', 'backdrop_path': '/hypTnLot2z8wpFS7qwsQHW1uV8u.jpg'}",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, 'name': 'Comedy'}]",,15602,tt0113228,en,Grumpier Old Men,"A family wedding reignites the ancient feud between next-door neighbors and fishing buddies John and Max. Meanwhile, a sultry Italian divorcée opens a restaurant at the local bait shop, alarming the locals who worry she'll scare the fish away. But she's less interested in seafood than she is in cooking up a hot time with Max.",11.7129,/6ksm1sjKMFLbO7UY2i6G1ju9SML.jpg,"[{'name': 'Warner Bros.', 'id': 6194}, {'name': 'Lancaster Gate', 'id': 19464}]","[{'iso_3166_1': 'US', 'name': 'United States of America'}]",1995-12-22,0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for Love.,Grumpier Old Men,False,6.5,92
False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'name': 'Drama'}, {'id': 10749, 'name': 'Romance'}]",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the women are holding their breath, waiting for the elusive ""good man"" to break a string of less-than-stellar lovers. Friends and confidants Vannah, Bernie, Glo and Robin talk it all out, determined to find a better way to breathe.",3.859495,/16XOMpEaLWkrcPqSQqhTmeJuqQl.jpg,"[{'name': 'Twentieth Century Fox Film Corporation', 'id': 306}]","[{'iso_3166_1': 'US', 'name': 'United States of America'}]",1995-12-22,81452156,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself... and never let you forget it.,Waiting to Exhale,False,6.1,34

  release_date                                             genres    budget
0   1995-10-30  [{'id': 16, 'name': 'Animation'}, {'id': 35, '...  30000000
1   1995-12-15  [{'id': 12, 'name': 'Adventure'}, {'id': 14, '...  65000000
2   1995-12-22  [{'id': 10749, 'name': 'Romance'}, {'id': 35, ...         0

' I am trying to normalize the data vertically and have tried this https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.json_normalize.html:

But my data is brought in from a csv and not in a json format. Wanting:

data = [{'state': 'Florida',
         'shortname': 'FL',
         'info': {'governor': 'Rick Scott'},
         'counties': [{'name': 'Dade', 'population': 12345},
                      {'name': 'Broward', 'population': 40000},
                      {'name': 'Palm Beach', 'population': 60000}]},
        {'state': 'Ohio',
         'shortname': 'OH',
         'info': {'governor': 'John Kasich'},
         'counties': [{'name': 'Summit', 'population': 1234},
                      {'name': 'Cuyahoga', 'population': 1337}]}]
result = pd.json_normalize(data, 'counties', ['state', 'shortname',
                                           ['info', 'governor']])
result
         name  population    state shortname info.governor
0        Dade       12345   Florida    FL    Rick Scott
1     Broward       40000   Florida    FL    Rick Scott
2  Palm Beach       60000   Florida    FL    Rick Scott
3      Summit        1234   Ohio       OH    John Kasich
4    Cuyahoga        1337   Ohio       OH    John Kasich

I want to break out the genre so it looks like this:

  release_date  genres.id          genres.name                     budget
0   1995-10-30  16                 Animation                      30000000
1   1995-10-30  35.                Genre                          30000000
2   1995-12-15  12                 Adventure                      65000000
3   1995-12-15  14                 Genre                          65000000

.
.
100 1995-12-22  10749              Romance                        0
101 1995-12-22  35                 Genre                          0

I have tried to use

for data in test:
    data_row = data['genres']
    time = data['release_date']
      
    for row in data_row:
        row['Time'] = time
        rows.append(row)

and

result = pd.json_normalize(test.to_dict(), 'genres', ['budget'])

But am unsuccessful as my file is not a json and get errors AttributeError: 'str' object has no attribute 'values' also I am not sure if I have researched the right keywords

Upvotes: 1

Views: 588

Answers (2)

Ank
Ank

Reputation: 1714

Another way to do it using pandas.DataFrame.apply and pandas.DataFrame.explode:

df = df.loc[:,['release_date','genres','budget']]
df['genres'] = df.genres.apply(eval)
df = df.explode('genres').dropna()
df[['genres.id','genres.name']] = df.genres.apply(pd.Series)
df.drop('genres', axis=1, inplace=True)

Output:

>>> df
  release_date    budget  genres.id genres.name
0   1995-10-30  30000000         16   Animation
0   1995-10-30  30000000         35      Comedy
0   1995-10-30  30000000      10751      Family
1   1995-12-15  65000000         12   Adventure
1   1995-12-15  65000000         14     Fantasy
1   1995-12-15  65000000      10751      Family
2   1995-12-22         0      10749     Romance
2   1995-12-22         0         35      Comedy
3   1995-12-22  16000000         35      Comedy
3   1995-12-22  16000000         18       Drama
3   1995-12-22  16000000      10749     Romance

Steps:

Since the list of dictionaries in genres column are actually string values, apply eval to convert them to back to list of dictionaries.

Use pd.explode method to transform dictionaries in each list into separate row values (so only one dictionary in each row):

>>> df['genres'] = df.genres.apply(eval)
>>> df = df.explode('genres').dropna()
>>> df
  release_date                            genres    budget
0   1995-10-30   {'id': 16, 'name': 'Animation'}  30000000
0   1995-10-30      {'id': 35, 'name': 'Comedy'}  30000000
0   1995-10-30   {'id': 10751, 'name': 'Family'}  30000000
1   1995-12-15   {'id': 12, 'name': 'Adventure'}  65000000
1   1995-12-15     {'id': 14, 'name': 'Fantasy'}  65000000
1   1995-12-15   {'id': 10751, 'name': 'Family'}  65000000
2   1995-12-22  {'id': 10749, 'name': 'Romance'}         0
2   1995-12-22      {'id': 35, 'name': 'Comedy'}         0
3   1995-12-22      {'id': 35, 'name': 'Comedy'}  16000000
3   1995-12-22       {'id': 18, 'name': 'Drama'}  16000000
3   1995-12-22  {'id': 10749, 'name': 'Romance'}  16000000

Convert the dictionary key-values into separate columns by applying pandas.Series on it. Drop the old genres column:

>>> df[['genres.id','genres.name']] = df.genres.apply(pd.Series)
>>> df.drop('genres', axis=1, inplace=True)
>>> df
  release_date    budget  genres.id genres.name
0   1995-10-30  30000000         16   Animation
0   1995-10-30  30000000         35      Comedy
0   1995-10-30  30000000      10751      Family
1   1995-12-15  65000000         12   Adventure
1   1995-12-15  65000000         14     Fantasy
1   1995-12-15  65000000      10751      Family
2   1995-12-22         0      10749     Romance
2   1995-12-22         0         35      Comedy
3   1995-12-22  16000000         35      Comedy
3   1995-12-22  16000000         18       Drama
3   1995-12-22  16000000      10749     Romance

Upvotes: 1

Jonathan Leon
Jonathan Leon

Reputation: 5648

Here's one way using flatten_json and apply(lambda x:) . For some reason, json_normalize wouldn't work for me.

I downloaded the file, read it in and used the first 10 rows for this.

from flatten_json import flatten
df = pd.read_csv('movies_metadata.csv', low_memory=False)
dft = df[0:10]

def flattenjson(x):
    dfa = pd.DataFrame((flatten(d, '.') for d in eval(x['genres'])))
    dfa[['release_date', 'original_title', 'budget']] = x[['release_date', 'original_title', 'budget']]
    df_list.append(dfa)

df_list = []
dft.apply(lambda x: flattenjson(x), axis=1)
pd.concat(df_list)

      id       name release_date               original_title    budget
0     16  Animation   1995-10-30                    Toy Story  30000000
1     35     Comedy   1995-10-30                    Toy Story  30000000
2  10751     Family   1995-10-30                    Toy Story  30000000
0     12  Adventure   1995-12-15                      Jumanji  65000000
1     14    Fantasy   1995-12-15                      Jumanji  65000000
2  10751     Family   1995-12-15                      Jumanji  65000000
0  10749    Romance   1995-12-22             Grumpier Old Men         0
1     35     Comedy   1995-12-22             Grumpier Old Men         0
0     35     Comedy   1995-12-22            Waiting to Exhale  16000000
1     18      Drama   1995-12-22            Waiting to Exhale  16000000
2  10749    Romance   1995-12-22            Waiting to Exhale  16000000
0     35     Comedy   1995-02-10  Father of the Bride Part II         0
0     28     Action   1995-12-15                         Heat  60000000
1     80      Crime   1995-12-15                         Heat  60000000
2     18      Drama   1995-12-15                         Heat  60000000
3     53   Thriller   1995-12-15                         Heat  60000000
0     35     Comedy   1995-12-15                      Sabrina  58000000
1  10749    Romance   1995-12-15                      Sabrina  58000000
0     28     Action   1995-12-22                 Tom and Huck         0
1     12  Adventure   1995-12-22                 Tom and Huck         0
2     18      Drama   1995-12-22                 Tom and Huck         0
3  10751     Family   1995-12-22                 Tom and Huck         0
0     28     Action   1995-12-22                 Sudden Death  35000000
1     12  Adventure   1995-12-22                 Sudden Death  35000000
2     53   Thriller   1995-12-22                 Sudden Death  35000000
0     12  Adventure   1995-11-16                    GoldenEye  58000000
1     28     Action   1995-11-16                    GoldenEye  58000000
2     53   Thriller   1995-11-16                    GoldenEye  58000000

Upvotes: 2

Related Questions