Reputation: 147
I have the following df:
movie_id rating_all
0 tt7653254 [{'age': 'all', 'avg_rating': 8.1, 'count': 109326}, {'age': '<18', 'avg_rating': 8.8, 'count': 318}, {'age': '18-29', 'avg_rating': 8.3, 'count': 29740}, {'age': '30-44', 'avg_rating': 8.0, 'count': 33012}, {'age': '45+', 'avg_rating': 7.7, 'count': 7875}]
1 tt8579674 [{'age': 'all', 'avg_rating': 8.6, 'count': 9420}, {'age': '<18', 'avg_rating': 9.1, 'count': 35}, {'age': '18-29', 'avg_rating': 8.7, 'count': 2437}, {'age': '30-44', 'avg_rating': 8.5, 'count': 2529}, {'age': '45+', 'avg_rating': 8.3, 'count': 960}]
2 tt7286456 [{'age': 'all', 'avg_rating': 8.6, 'count': 592441}, {'age': '<18', 'avg_rating': 9.1, 'count': 2244}, {'age': '18-29', 'avg_rating': 8.7, 'count': 160506}, {'age': '30-44', 'avg_rating': 8.5, 'count': 160158}, {'age': '45+', 'avg_rating': 8.3, 'count': 30451}]
3 tt1302006 [{'age': 'all', 'avg_rating': 8.1, 'count': 187675}, {'age': '<18', 'avg_rating': 8.7, 'count': 461}, {'age': '18-29', 'avg_rating': 8.3, 'count': 41951}, {'age': '30-44', 'avg_rating': 7.9, 'count': 59729}, {'age': '45+', 'avg_rating': 7.8, 'count': 18550}]
4 tt7131622 [{'age': 'all', 'avg_rating': 7.8, 'count': 323152}, {'age': '<18', 'avg_rating': 8.4, 'count': 955}, {'age': '18-29', 'avg_rating': 7.9, 'count': 82133}, {'age': '30-44', 'avg_rating': 7.6, 'count': 95878}, {'age': '45+', 'avg_rating': 7.5, 'count': 26383}]
5 tt8637428 [{'age': 'all', 'avg_rating': 7.7, 'count': 21362}, {'age': '<18', 'avg_rating': 8.0, 'count': 45}, {'age': '18-29', 'avg_rating': 7.9, 'count': 5901}, {'age': '30-44', 'avg_rating': 7.6, 'count': 6492}, {'age': '45+', 'avg_rating': 7.3, 'count': 2133}]
And I want to transform it to something like:
movie_id all_avg all_count <18_avg <18_count 18-29_avg
0 tt7653254 8.1 109326 8.8 318 8.3
1 tt8579674 8.6 9420 9.1 35 8.7
2 tt7286456 8.6 592441 9.1 2244 8.7
3 tt1302006 8.1 187675 8.7 461 8.3
4 tt7131622 7.8 323152 8.4 955 7.9
5 tt8637428 7.7 21362 8 45 7.9
and so on...
I've tried
ratings.set_index('movie_id')['rating_all'].apply(pd.Series).reset_index()
and using
json_normalize(data,
record_path=['rating_all'],
meta=['movie_id']).set_index('movie_id')
but none give something similiar. Is there an easy way to explode the dictionary into columns?
Data to produce the DataFrame:
data = {'movie_id': ['tt7653254', 'tt8579674', 'tt7286456', 'tt1302006', 'tt7131622', 'tt8637428'],
'rating_all': [[{'age': 'all', 'avg_rating': 8.1, 'count': 109326},
{'age': '<18', 'avg_rating': 8.8, 'count': 318},
{'age': '18-29', 'avg_rating': 8.3, 'count': 29740},
{'age': '30-44', 'avg_rating': 8.0, 'count': 33012},
{'age': '45+', 'avg_rating': 7.7, 'count': 7875}],
[{'age': 'all', 'avg_rating': 8.6, 'count': 9420},
{'age': '<18', 'avg_rating': 9.1, 'count': 35},
{'age': '18-29', 'avg_rating': 8.7, 'count': 2437},
{'age': '30-44', 'avg_rating': 8.5, 'count': 2529},
{'age': '45+', 'avg_rating': 8.3, 'count': 960}],
[{'age': 'all', 'avg_rating': 8.6, 'count': 592441},
{'age': '<18', 'avg_rating': 9.1, 'count': 2244},
{'age': '18-29', 'avg_rating': 8.7, 'count': 160506},
{'age': '30-44', 'avg_rating': 8.5, 'count': 160158},
{'age': '45+', 'avg_rating': 8.3, 'count': 30451}],
[{'age': 'all', 'avg_rating': 8.1, 'count': 187675},
{'age': '<18', 'avg_rating': 8.7, 'count': 461},
{'age': '18-29', 'avg_rating': 8.3, 'count': 41951},
{'age': '30-44', 'avg_rating': 7.9, 'count': 59729},
{'age': '45+', 'avg_rating': 7.8, 'count': 18550}],
[{'age': 'all', 'avg_rating': 7.8, 'count': 323152},
{'age': '<18', 'avg_rating': 8.4, 'count': 955},
{'age': '18-29', 'avg_rating': 7.9, 'count': 82133},
{'age': '30-44', 'avg_rating': 7.6, 'count': 95878},
{'age': '45+', 'avg_rating': 7.5, 'count': 26383}],
[{'age': 'all', 'avg_rating': 7.7, 'count': 21362},
{'age': '<18', 'avg_rating': 8.0, 'count': 45},
{'age': '18-29', 'avg_rating': 7.9, 'count': 5901},
{'age': '30-44', 'avg_rating': 7.6, 'count': 6492},
{'age': '45+', 'avg_rating': 7.3, 'count': 2133}]]}
Upvotes: 4
Views: 8642
Reputation:
An alternative method is to set_index
with "movie_id" and explode
the rating_all
column (this creates a Series with duplicate indexes). Then cast it to a DataFrame constructor to build the preliminary DataFrame. Then use pivot
on it to get the output in the desired format.
s = df.set_index('movie_id')['rating_all'].explode()
tmp = pd.DataFrame(s.tolist(), index = s.index).reset_index()
out = tmp.pivot('movie_id', 'age', ['avg_rating','count'])
Output:
avg_rating count
age 18-29 30-44 45+ <18 all 18-29 30-44 45+ <18 all
movie_id
tt1302006 8.3 7.9 7.8 8.7 8.1 41951 59729 18550 461 187675
tt7131622 7.9 7.6 7.5 8.4 7.8 82133 95878 26383 955 323152
tt7286456 8.7 8.5 8.3 9.1 8.6 160506 160158 30451 2244 592441
tt7653254 8.3 8.0 7.7 8.8 8.1 29740 33012 7875 318 109326
tt8579674 8.7 8.5 8.3 9.1 8.6 2437 2529 960 35 9420
tt8637428 7.9 7.6 7.3 8.0 7.7 5901 6492 2133 45 21362
It turns out, set_index
+ explode
+ DataFrame
+ pivot
is quite a bit faster than groupby
+ apply(DataFrame)
+ pivot_table
:
%timeit -n 1000 s = df.set_index('movie_id')['rating_all'].explode(); tmp = pd.DataFrame(s.tolist(), index = s.index).reset_index(); out = tmp.pivot('movie_id', 'age', ['avg_rating','count'])
%timeit -n 1000 s = df.groupby('movie_id').rating_all.apply(lambda x: pd.DataFrame(x.values[0])).reset_index(); out = s.pivot_table(index='movie_id', columns=['age'],values=['avg_rating','count'],aggfunc='mean')
5.1 ms ± 465 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
21.6 ms ± 1.53 ms per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Upvotes: 1
Reputation: 2293
Assuming I've interpreted correctly what you're trying to do, you can achieve it as follows:
Starting from (what I assume!!) is your input dataset:
movie_id ratings
0 tt7653254 [{'age': 'all', 'avg_rating': 8.1, 'count': 10...
1 tt8579674 [{'age': 'all', 'avg_rating': 8.6, 'count': 94...
2 tt7286456 [{'age': 'all', 'avg_rating': 8.6, 'count': 59...
3 tt1302006 [{'age': 'all', 'avg_rating': 8.1, 'count': 18...
4 tt7131622 [{'age': 'all', 'avg_rating': 7.8, 'count': 32...
5 tt8637428 [{'age': 'all', 'avg_rating': 7.7, 'count': 21...
Where the ratings values are an actual list, not a string (if they're a string, x['ratings'] = x.ratings.apply(eval)
to turn them into an object).
First you want to explode each of the rows in the list to be a set of rows:
parsed = x.groupby('movie_id').ratings.apply(lambda x: pd.DataFrame(x.values[0])).reset_index()
Which will give you:
> parsed.head(10)
movie_id level_1 age avg_rating count
0 tt1302006 0 all 8.1 187675
1 tt1302006 1 <18 8.7 461
2 tt1302006 2 18-29 8.3 41951
3 tt1302006 3 30-44 7.9 59729
4 tt1302006 4 45+ 7.8 18550
5 tt7131622 0 all 7.8 323152
6 tt7131622 1 <18 8.4 955
7 tt7131622 2 18-29 7.9 82133
8 tt7131622 3 30-44 7.6 95878
9 tt7131622 4 45+ 7.5 26383
...
etc.
This is the key step. It's similar to the apply(pd.Series)
step you've tried, but the difference is in the groupby
. This allows us to return many rows per movie_id, instead of pandas trying to interpret what we return as a single row. It feels a little like a hack, but hey, it works!
Now you can pivot on count & rating to get the tabulated values:
tabulated= parsed.pivot_table(
index='movie_id',
columns=['age'],
values=['avg_rating','count'],
aggfunc='mean'
)
I've used mean here, sum would also work assuming there's only one row of data per film and age group
> tabulated
avg_rating count
age 18-29 30-44 45+ <18 all 18-29 30-44 45+ <18 all
movie_id
tt1302006 8.3 7.9 7.8 8.7 8.1 41951 59729 18550 461 187675
tt7131622 7.9 7.6 7.5 8.4 7.8 82133 95878 26383 955 323152
tt7286456 8.7 8.5 8.3 9.1 8.6 160506 160158 30451 2244 592441
tt7653254 8.3 8.0 7.7 8.8 8.1 29740 33012 7875 318 109326
tt8579674 8.7 8.5 8.3 9.1 8.6 2437 2529 960 35 9420
tt8637428 7.9 7.6 7.3 8.0 7.7 5901 6492 2133 45 21362
Upvotes: 5