MarkS
MarkS

Reputation: 1539

Pandas groupby individual elements from a larger split string

I have a 25 million row df that has a column of movie genres separated by a '|' character:

   userId  movieId  rating   timestamp                title                       genres
0       1      296     5.0  1147880044  Pulp Fiction (1994)  Comedy|Crime|Drama|Thriller
1       3      296     5.0  1439474476  Pulp Fiction (1994)  Comedy|Crime|Drama|Thriller
2       4      296     4.0  1573938898  Pulp Fiction (1994)  Comedy|Crime|Drama|Thriller
3       5      296     4.0   830786155  Pulp Fiction (1994)  Comedy|Crime|Drama|Thriller
4       7      296     4.0   835444730  Pulp Fiction (1994)  Comedy|Crime|Drama|Thriller 

I want to get the mean rating by unique genre.

I can extract all of the unique genres with this:

genres = pd.unique(df2['genres'].str.split('|', expand=True).stack())

Which yields:

['Adventure' 'Animation' 'Children' 'Comedy' 'Fantasy' 'Romance' 'Drama'
 'Action' 'Crime' 'Thriller' 'Horror' 'Mystery' 'Sci-Fi' 'IMAX'
 'Documentary' 'War' 'Musical' 'Western' 'Film-Noir' '(no genres listed)']

I can isolate rows containing a specific genre with (setting genre equal to a valid genre):

result[result['genres'].str.contains(genre)]

I found something close: pandas: Group by splitting string value in all rows (a column) and aggregation function

but I can't seem to get the syntax to groupby each individual genre so I can get the mean rating per genre.

Upvotes: 1

Views: 238

Answers (1)

Erfan
Erfan

Reputation: 42916

Using DataFrame.assign, Series.str.split, DataFrame.explode and GroupBy.mean:

dfg = (
    df.assign(genres=df["genres"].str.split("|"))
    .explode("genres")
    .groupby("genres", as_index=False)["rating"].mean()
)
     genres  rating
0    Comedy     4.4
1     Crime     4.4
2     Drama     4.4
3  Thriller     4.4

Upvotes: 2

Related Questions