lela_rib
lela_rib

Reputation: 147

Explode dict from Pandas column

I have the following df:

    movie_id    rating_all 
0   tt7653254   [{'age': 'all', 'avg_rating': 8.1, 'count': 109326}, {'age': '<18', 'avg_rating': 8.8, 'count': 318}, {'age': '18-29', 'avg_rating': 8.3, 'count': 29740}, {'age': '30-44', 'avg_rating': 8.0, 'count': 33012}, {'age': '45+', 'avg_rating': 7.7, 'count': 7875}]
1   tt8579674   [{'age': 'all', 'avg_rating': 8.6, 'count': 9420}, {'age': '<18', 'avg_rating': 9.1, 'count': 35}, {'age': '18-29', 'avg_rating': 8.7, 'count': 2437}, {'age': '30-44', 'avg_rating': 8.5, 'count': 2529}, {'age': '45+', 'avg_rating': 8.3, 'count': 960}]
2   tt7286456   [{'age': 'all', 'avg_rating': 8.6, 'count': 592441}, {'age': '<18', 'avg_rating': 9.1, 'count': 2244}, {'age': '18-29', 'avg_rating': 8.7, 'count': 160506}, {'age': '30-44', 'avg_rating': 8.5, 'count': 160158}, {'age': '45+', 'avg_rating': 8.3, 'count': 30451}]
3   tt1302006   [{'age': 'all', 'avg_rating': 8.1, 'count': 187675}, {'age': '<18', 'avg_rating': 8.7, 'count': 461}, {'age': '18-29', 'avg_rating': 8.3, 'count': 41951}, {'age': '30-44', 'avg_rating': 7.9, 'count': 59729}, {'age': '45+', 'avg_rating': 7.8, 'count': 18550}]
4   tt7131622   [{'age': 'all', 'avg_rating': 7.8, 'count': 323152}, {'age': '<18', 'avg_rating': 8.4, 'count': 955}, {'age': '18-29', 'avg_rating': 7.9, 'count': 82133}, {'age': '30-44', 'avg_rating': 7.6, 'count': 95878}, {'age': '45+', 'avg_rating': 7.5, 'count': 26383}]
5   tt8637428   [{'age': 'all', 'avg_rating': 7.7, 'count': 21362}, {'age': '<18', 'avg_rating': 8.0, 'count': 45}, {'age': '18-29', 'avg_rating': 7.9, 'count': 5901}, {'age': '30-44', 'avg_rating': 7.6, 'count': 6492}, {'age': '45+', 'avg_rating': 7.3, 'count': 2133}]

And I want to transform it to something like:

    movie_id    all_avg     all_count   <18_avg     <18_count   18-29_avg
0   tt7653254   8.1         109326      8.8         318         8.3
1   tt8579674   8.6         9420        9.1         35          8.7
2   tt7286456   8.6         592441      9.1         2244        8.7
3   tt1302006   8.1         187675      8.7         461         8.3
4   tt7131622   7.8         323152      8.4         955         7.9
5   tt8637428   7.7         21362       8           45          7.9

and so on...

I've tried

ratings.set_index('movie_id')['rating_all'].apply(pd.Series).reset_index()

and using

json_normalize(data, 
               record_path=['rating_all'], 
               meta=['movie_id']).set_index('movie_id')

but none give something similiar. Is there an easy way to explode the dictionary into columns?

Data to produce the DataFrame:

data = {'movie_id': ['tt7653254', 'tt8579674', 'tt7286456', 'tt1302006', 'tt7131622', 'tt8637428'], 
        'rating_all': [[{'age': 'all', 'avg_rating': 8.1, 'count': 109326},
                        {'age': '<18', 'avg_rating': 8.8, 'count': 318},   
                        {'age': '18-29', 'avg_rating': 8.3, 'count': 29740},
                        {'age': '30-44', 'avg_rating': 8.0, 'count': 33012},
                        {'age': '45+', 'avg_rating': 7.7, 'count': 7875}],
                       [{'age': 'all', 'avg_rating': 8.6, 'count': 9420},
                        {'age': '<18', 'avg_rating': 9.1, 'count': 35},
                        {'age': '18-29', 'avg_rating': 8.7, 'count': 2437},
                        {'age': '30-44', 'avg_rating': 8.5, 'count': 2529},
                        {'age': '45+', 'avg_rating': 8.3, 'count': 960}],
                       [{'age': 'all', 'avg_rating': 8.6, 'count': 592441},
                        {'age': '<18', 'avg_rating': 9.1, 'count': 2244},
                        {'age': '18-29', 'avg_rating': 8.7, 'count': 160506},
                        {'age': '30-44', 'avg_rating': 8.5, 'count': 160158},
                        {'age': '45+', 'avg_rating': 8.3, 'count': 30451}],
                       [{'age': 'all', 'avg_rating': 8.1, 'count': 187675},
                        {'age': '<18', 'avg_rating': 8.7, 'count': 461},
                        {'age': '18-29', 'avg_rating': 8.3, 'count': 41951},
                        {'age': '30-44', 'avg_rating': 7.9, 'count': 59729},
                        {'age': '45+', 'avg_rating': 7.8, 'count': 18550}],
                       [{'age': 'all', 'avg_rating': 7.8, 'count': 323152},
                        {'age': '<18', 'avg_rating': 8.4, 'count': 955},
                        {'age': '18-29', 'avg_rating': 7.9, 'count': 82133},
                        {'age': '30-44', 'avg_rating': 7.6, 'count': 95878},
                        {'age': '45+', 'avg_rating': 7.5, 'count': 26383}],
                       [{'age': 'all', 'avg_rating': 7.7, 'count': 21362},
                        {'age': '<18', 'avg_rating': 8.0, 'count': 45},
                        {'age': '18-29', 'avg_rating': 7.9, 'count': 5901},
                        {'age': '30-44', 'avg_rating': 7.6, 'count': 6492},
                        {'age': '45+', 'avg_rating': 7.3, 'count': 2133}]]}

Upvotes: 4

Views: 8642

Answers (2)

user7864386
user7864386

Reputation:

An alternative method is to set_index with "movie_id" and explode the rating_all column (this creates a Series with duplicate indexes). Then cast it to a DataFrame constructor to build the preliminary DataFrame. Then use pivot on it to get the output in the desired format.

s = df.set_index('movie_id')['rating_all'].explode()
tmp = pd.DataFrame(s.tolist(), index = s.index).reset_index()
out = tmp.pivot('movie_id', 'age', ['avg_rating','count'])

Output:

          avg_rating                        count                             
age            18-29 30-44  45+  <18  all   18-29   30-44    45+   <18         all
movie_id                                                                      
tt1302006        8.3   7.9  7.8  8.7  8.1   41951   59729  18550   461      187675
tt7131622        7.9   7.6  7.5  8.4  7.8   82133   95878  26383   955      323152
tt7286456        8.7   8.5  8.3  9.1  8.6  160506  160158  30451  2244      592441
tt7653254        8.3   8.0  7.7  8.8  8.1   29740   33012   7875   318      109326
tt8579674        8.7   8.5  8.3  9.1  8.6    2437    2529    960    35        9420
tt8637428        7.9   7.6  7.3  8.0  7.7    5901    6492   2133    45       21362

It turns out, set_index + explode + DataFrame + pivot is quite a bit faster than groupby + apply(DataFrame) + pivot_table:

%timeit -n 1000 s = df.set_index('movie_id')['rating_all'].explode(); tmp = pd.DataFrame(s.tolist(), index = s.index).reset_index(); out = tmp.pivot('movie_id', 'age', ['avg_rating','count'])
%timeit -n 1000 s = df.groupby('movie_id').rating_all.apply(lambda x: pd.DataFrame(x.values[0])).reset_index(); out = s.pivot_table(index='movie_id', columns=['age'],values=['avg_rating','count'],aggfunc='mean')
5.1 ms ± 465 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
21.6 ms ± 1.53 ms per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Upvotes: 1

greg_data
greg_data

Reputation: 2293

Assuming I've interpreted correctly what you're trying to do, you can achieve it as follows:

Starting from (what I assume!!) is your input dataset:

    movie_id                                            ratings
0  tt7653254  [{'age': 'all', 'avg_rating': 8.1, 'count': 10...
1  tt8579674  [{'age': 'all', 'avg_rating': 8.6, 'count': 94...
2  tt7286456  [{'age': 'all', 'avg_rating': 8.6, 'count': 59...
3  tt1302006  [{'age': 'all', 'avg_rating': 8.1, 'count': 18...
4  tt7131622  [{'age': 'all', 'avg_rating': 7.8, 'count': 32...
5  tt8637428  [{'age': 'all', 'avg_rating': 7.7, 'count': 21...

Where the ratings values are an actual list, not a string (if they're a string, x['ratings'] = x.ratings.apply(eval) to turn them into an object).

First you want to explode each of the rows in the list to be a set of rows:

parsed = x.groupby('movie_id').ratings.apply(lambda x: pd.DataFrame(x.values[0])).reset_index()

Which will give you:

> parsed.head(10)
    movie_id  level_1    age  avg_rating   count
0  tt1302006        0    all         8.1  187675
1  tt1302006        1    <18         8.7     461
2  tt1302006        2  18-29         8.3   41951
3  tt1302006        3  30-44         7.9   59729
4  tt1302006        4    45+         7.8   18550
5  tt7131622        0    all         7.8  323152
6  tt7131622        1    <18         8.4     955
7  tt7131622        2  18-29         7.9   82133
8  tt7131622        3  30-44         7.6   95878
9  tt7131622        4    45+         7.5   26383
...

etc.

This is the key step. It's similar to the apply(pd.Series) step you've tried, but the difference is in the groupby. This allows us to return many rows per movie_id, instead of pandas trying to interpret what we return as a single row. It feels a little like a hack, but hey, it works!

Now you can pivot on count & rating to get the tabulated values:

tabulated= parsed.pivot_table(
    index='movie_id',
    columns=['age'],
    values=['avg_rating','count'],
    aggfunc='mean'
)

I've used mean here, sum would also work assuming there's only one row of data per film and age group

> tabulated
          avg_rating                        count                             
age            18-29 30-44  45+  <18  all   18-29   30-44    45+   <18         all
movie_id                                                                      
tt1302006        8.3   7.9  7.8  8.7  8.1   41951   59729  18550   461      187675
tt7131622        7.9   7.6  7.5  8.4  7.8   82133   95878  26383   955      323152
tt7286456        8.7   8.5  8.3  9.1  8.6  160506  160158  30451  2244      592441
tt7653254        8.3   8.0  7.7  8.8  8.1   29740   33012   7875   318      109326
tt8579674        8.7   8.5  8.3  9.1  8.6    2437    2529    960    35        9420
tt8637428        7.9   7.6  7.3  8.0  7.7    5901    6492   2133    45       21362

Upvotes: 5

Related Questions