Ii Oo
Ii Oo

Reputation: 95

Read in a csv with a different separator based on the each column?

Column id and runtime are comma-separated. However, column genres is separated by Pipe(|). df = pd.read_csv(path, sep=',') results in the table below. However, I can't conduct any queries on column genres, for instance finding the most popular genre by year? Is it possible to separate pipe into separate rows?

df.head()
    id  runtime genres                                       Year
0   135397  124 Action|Adventure|Science Fiction|Thriller    2000
1   76341   120 Action|Adventure|Science Fiction|Thriller    2002
2   262500  119 Adventure|Science Fiction|Thriller           2001
3   140607  136 Action|Adventure|Science Fiction|Fantasy     2000
4   168259  137 Action|Crime|Thriller                        1999

Upvotes: 0

Views: 33

Answers (1)

You're better reading the file as is, then split the genres into new rows with pandas explode:

df = df.assign(genres = df.genres.str.split('|')).explode('genres')

so that you can easily manipulate your data.


For example, to get the most frequent (i.e. mode) genres per year:

df.groupby('Year').genres.apply(lambda x: x.mode()).droplevel(1)

To identify the counts:

def get_all_max(grp):
    counts = grp.value_counts()
    return counts[counts==counts.max()]

df.groupby('Year').genres.apply(get_all_max)\
.rename_axis(index={None:'Genre'}).to_frame(name='Count')

Upvotes: 1

Related Questions