Reputation: 95
Column id and runtime are comma-separated. However, column genres is separated by Pipe(|).
df = pd.read_csv(path, sep=',')
results in the table below. However, I can't conduct any queries on column genres, for instance finding the most popular genre by year? Is it possible to separate pipe into separate rows?
df.head()
id runtime genres Year
0 135397 124 Action|Adventure|Science Fiction|Thriller 2000
1 76341 120 Action|Adventure|Science Fiction|Thriller 2002
2 262500 119 Adventure|Science Fiction|Thriller 2001
3 140607 136 Action|Adventure|Science Fiction|Fantasy 2000
4 168259 137 Action|Crime|Thriller 1999
Upvotes: 0
Views: 33
Reputation: 4929
You're better reading the file as is, then split the genres into new rows with pandas explode
:
df = df.assign(genres = df.genres.str.split('|')).explode('genres')
so that you can easily manipulate your data.
For example, to get the most frequent (i.e. mode) genres per year:
df.groupby('Year').genres.apply(lambda x: x.mode()).droplevel(1)
To identify the counts:
def get_all_max(grp):
counts = grp.value_counts()
return counts[counts==counts.max()]
df.groupby('Year').genres.apply(get_all_max)\
.rename_axis(index={None:'Genre'}).to_frame(name='Count')
Upvotes: 1