Reputation: 1500
I know this has been done before but I am unable to accomplish it - I have read the data from kaffle - https://www.kaggle.com/rounakbanik/the-movies-dataset csv sample
adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,popularity,poster_path,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
False,"{'id': 10194, 'name': 'Toy Story Collection', 'poster_path': '/7G9915LfUQ2lVfwMEEhDsn3kT4B.jpg', 'backdrop_path': '/9FBwqcd9IRruEDUrTdcaafOMKUq.jpg'}",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, 'name': 'Comedy'}, {'id': 10751, 'name': 'Family'}]",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his room until Andy's birthday brings Buzz Lightyear onto the scene. Afraid of losing his place in Andy's heart, Woody plots against Buzz. But when circumstances separate Buzz and Woody from their owner, the duo eventually learns to put aside their differences.",21.946943,/rhIRbceoE9lR4veEXuwCC2wARtG.jpg,"[{'name': 'Pixar Animation Studios', 'id': 3}]","[{'iso_3166_1': 'US', 'name': 'United States of America'}]",1995-10-30,373554033,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415
False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, 'name': 'Fantasy'}, {'id': 10751, 'name': 'Family'}]",,8844,tt0113497,en,Jumanji,"When siblings Judy and Peter discover an enchanted board game that opens the door to a magical world, they unwittingly invite Alan -- an adult who's been trapped inside the game for 26 years -- into their living room. Alan's only hope for freedom is to finish the game, which proves risky as all three find themselves running from giant rhinoceroses, evil monkeys and other terrifying creatures.",17.015539,/vzmL6fP7aPKNKPRTFnZmiUfciyV.jpg,"[{'name': 'TriStar Pictures', 'id': 559}, {'name': 'Teitler Film', 'id': 2550}, {'name': 'Interscope Communications', 'id': 10201}]","[{'iso_3166_1': 'US', 'name': 'United States of America'}]",1995-12-15,262797249,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso_639_1': 'fr', 'name': 'Français'}]",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413
False,"{'id': 119050, 'name': 'Grumpy Old Men Collection', 'poster_path': '/nLvUdqgPgm3F85NMCii9gVFUcet.jpg', 'backdrop_path': '/hypTnLot2z8wpFS7qwsQHW1uV8u.jpg'}",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, 'name': 'Comedy'}]",,15602,tt0113228,en,Grumpier Old Men,"A family wedding reignites the ancient feud between next-door neighbors and fishing buddies John and Max. Meanwhile, a sultry Italian divorcée opens a restaurant at the local bait shop, alarming the locals who worry she'll scare the fish away. But she's less interested in seafood than she is in cooking up a hot time with Max.",11.7129,/6ksm1sjKMFLbO7UY2i6G1ju9SML.jpg,"[{'name': 'Warner Bros.', 'id': 6194}, {'name': 'Lancaster Gate', 'id': 19464}]","[{'iso_3166_1': 'US', 'name': 'United States of America'}]",1995-12-22,0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for Love.,Grumpier Old Men,False,6.5,92
False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'name': 'Drama'}, {'id': 10749, 'name': 'Romance'}]",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the women are holding their breath, waiting for the elusive ""good man"" to break a string of less-than-stellar lovers. Friends and confidants Vannah, Bernie, Glo and Robin talk it all out, determined to find a better way to breathe.",3.859495,/16XOMpEaLWkrcPqSQqhTmeJuqQl.jpg,"[{'name': 'Twentieth Century Fox Film Corporation', 'id': 306}]","[{'iso_3166_1': 'US', 'name': 'United States of America'}]",1995-12-22,81452156,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself... and never let you forget it.,Waiting to Exhale,False,6.1,34
release_date genres budget
0 1995-10-30 [{'id': 16, 'name': 'Animation'}, {'id': 35, '... 30000000
1 1995-12-15 [{'id': 12, 'name': 'Adventure'}, {'id': 14, '... 65000000
2 1995-12-22 [{'id': 10749, 'name': 'Romance'}, {'id': 35, ... 0
' I am trying to normalize the data vertically and have tried this https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.json_normalize.html:
But my data is brought in from a csv and not in a json format. Wanting:
data = [{'state': 'Florida',
'shortname': 'FL',
'info': {'governor': 'Rick Scott'},
'counties': [{'name': 'Dade', 'population': 12345},
{'name': 'Broward', 'population': 40000},
{'name': 'Palm Beach', 'population': 60000}]},
{'state': 'Ohio',
'shortname': 'OH',
'info': {'governor': 'John Kasich'},
'counties': [{'name': 'Summit', 'population': 1234},
{'name': 'Cuyahoga', 'population': 1337}]}]
result = pd.json_normalize(data, 'counties', ['state', 'shortname',
['info', 'governor']])
result
name population state shortname info.governor
0 Dade 12345 Florida FL Rick Scott
1 Broward 40000 Florida FL Rick Scott
2 Palm Beach 60000 Florida FL Rick Scott
3 Summit 1234 Ohio OH John Kasich
4 Cuyahoga 1337 Ohio OH John Kasich
I want to break out the genre so it looks like this:
release_date genres.id genres.name budget
0 1995-10-30 16 Animation 30000000
1 1995-10-30 35. Genre 30000000
2 1995-12-15 12 Adventure 65000000
3 1995-12-15 14 Genre 65000000
.
.
100 1995-12-22 10749 Romance 0
101 1995-12-22 35 Genre 0
I have tried to use
for data in test:
data_row = data['genres']
time = data['release_date']
for row in data_row:
row['Time'] = time
rows.append(row)
and
result = pd.json_normalize(test.to_dict(), 'genres', ['budget'])
But am unsuccessful as my file is not a json and get errors AttributeError: 'str' object has no attribute 'values' also I am not sure if I have researched the right keywords
Upvotes: 1
Views: 588
Reputation: 1714
Another way to do it using pandas.DataFrame.apply
and pandas.DataFrame.explode
:
df = df.loc[:,['release_date','genres','budget']]
df['genres'] = df.genres.apply(eval)
df = df.explode('genres').dropna()
df[['genres.id','genres.name']] = df.genres.apply(pd.Series)
df.drop('genres', axis=1, inplace=True)
Output:
>>> df
release_date budget genres.id genres.name
0 1995-10-30 30000000 16 Animation
0 1995-10-30 30000000 35 Comedy
0 1995-10-30 30000000 10751 Family
1 1995-12-15 65000000 12 Adventure
1 1995-12-15 65000000 14 Fantasy
1 1995-12-15 65000000 10751 Family
2 1995-12-22 0 10749 Romance
2 1995-12-22 0 35 Comedy
3 1995-12-22 16000000 35 Comedy
3 1995-12-22 16000000 18 Drama
3 1995-12-22 16000000 10749 Romance
Since the list of dictionaries in genres column are actually string values, apply eval
to convert them to back to list of dictionaries.
Use pd.explode
method to transform dictionaries in each list into separate row values (so only one dictionary in each row):
>>> df['genres'] = df.genres.apply(eval)
>>> df = df.explode('genres').dropna()
>>> df
release_date genres budget
0 1995-10-30 {'id': 16, 'name': 'Animation'} 30000000
0 1995-10-30 {'id': 35, 'name': 'Comedy'} 30000000
0 1995-10-30 {'id': 10751, 'name': 'Family'} 30000000
1 1995-12-15 {'id': 12, 'name': 'Adventure'} 65000000
1 1995-12-15 {'id': 14, 'name': 'Fantasy'} 65000000
1 1995-12-15 {'id': 10751, 'name': 'Family'} 65000000
2 1995-12-22 {'id': 10749, 'name': 'Romance'} 0
2 1995-12-22 {'id': 35, 'name': 'Comedy'} 0
3 1995-12-22 {'id': 35, 'name': 'Comedy'} 16000000
3 1995-12-22 {'id': 18, 'name': 'Drama'} 16000000
3 1995-12-22 {'id': 10749, 'name': 'Romance'} 16000000
Convert the dictionary key-values into separate columns by applying pandas.Series
on it. Drop the old genres column:
>>> df[['genres.id','genres.name']] = df.genres.apply(pd.Series)
>>> df.drop('genres', axis=1, inplace=True)
>>> df
release_date budget genres.id genres.name
0 1995-10-30 30000000 16 Animation
0 1995-10-30 30000000 35 Comedy
0 1995-10-30 30000000 10751 Family
1 1995-12-15 65000000 12 Adventure
1 1995-12-15 65000000 14 Fantasy
1 1995-12-15 65000000 10751 Family
2 1995-12-22 0 10749 Romance
2 1995-12-22 0 35 Comedy
3 1995-12-22 16000000 35 Comedy
3 1995-12-22 16000000 18 Drama
3 1995-12-22 16000000 10749 Romance
Upvotes: 1
Reputation: 5648
Here's one way using flatten_json
and apply(lambda x:)
. For some reason, json_normalize
wouldn't work for me.
I downloaded the file, read it in and used the first 10 rows for this.
from flatten_json import flatten
df = pd.read_csv('movies_metadata.csv', low_memory=False)
dft = df[0:10]
def flattenjson(x):
dfa = pd.DataFrame((flatten(d, '.') for d in eval(x['genres'])))
dfa[['release_date', 'original_title', 'budget']] = x[['release_date', 'original_title', 'budget']]
df_list.append(dfa)
df_list = []
dft.apply(lambda x: flattenjson(x), axis=1)
pd.concat(df_list)
id name release_date original_title budget
0 16 Animation 1995-10-30 Toy Story 30000000
1 35 Comedy 1995-10-30 Toy Story 30000000
2 10751 Family 1995-10-30 Toy Story 30000000
0 12 Adventure 1995-12-15 Jumanji 65000000
1 14 Fantasy 1995-12-15 Jumanji 65000000
2 10751 Family 1995-12-15 Jumanji 65000000
0 10749 Romance 1995-12-22 Grumpier Old Men 0
1 35 Comedy 1995-12-22 Grumpier Old Men 0
0 35 Comedy 1995-12-22 Waiting to Exhale 16000000
1 18 Drama 1995-12-22 Waiting to Exhale 16000000
2 10749 Romance 1995-12-22 Waiting to Exhale 16000000
0 35 Comedy 1995-02-10 Father of the Bride Part II 0
0 28 Action 1995-12-15 Heat 60000000
1 80 Crime 1995-12-15 Heat 60000000
2 18 Drama 1995-12-15 Heat 60000000
3 53 Thriller 1995-12-15 Heat 60000000
0 35 Comedy 1995-12-15 Sabrina 58000000
1 10749 Romance 1995-12-15 Sabrina 58000000
0 28 Action 1995-12-22 Tom and Huck 0
1 12 Adventure 1995-12-22 Tom and Huck 0
2 18 Drama 1995-12-22 Tom and Huck 0
3 10751 Family 1995-12-22 Tom and Huck 0
0 28 Action 1995-12-22 Sudden Death 35000000
1 12 Adventure 1995-12-22 Sudden Death 35000000
2 53 Thriller 1995-12-22 Sudden Death 35000000
0 12 Adventure 1995-11-16 GoldenEye 58000000
1 28 Action 1995-11-16 GoldenEye 58000000
2 53 Thriller 1995-11-16 GoldenEye 58000000
Upvotes: 2