Antrikshy
Antrikshy

Reputation: 3106

Seaborn scatterplot size based on frequency of occurrence

I'm trying to plot data using the Seaborn library where:

x-axis - movie release year
y-axis - movie rating (0-10, discrete)

I'm using a scatterplot at the moment. My data is in a Pandas dataframe.

Obviously because the rating data I have is discrete integers, a lot of them stack on top of each other. How can I make the size of each dot scale with the frequency of appearance in the dataset?

For instance, if the number of 6/10 ratings in 2008 is higher than any other rating/year combination, I want that dot size (or something else in the plot) to indicate this.

Is there a different plot I should use for something like this instead?

Upvotes: 0

Views: 2799

Answers (1)

tdy
tdy

Reputation: 41327

Is there a different plot I should use for something like this instead?

I suggest visualizing this as a heatmap of a rating-year crosstab:

years = range(df['Release Year'].min(), df['Release Year'].max() + 1)
cross = pd.crosstab(df['IMDB Rating'], df['Release Year']).reindex(columns=years, fill_value=0)

fig, ax = plt.subplots(figsize=(30, 5))
sns.heatmap(cross, cbar_kws=dict(label='Count'), ax=ax)
ax.invert_yaxis()

heatmap output

But if you still prefer a scatterplot bubble chart, set the size param via groupby.size:

counts = df.groupby(['Release Year', 'IMDB Rating']).size().reset_index(name='Count')

fig, ax = plt.subplots(figsize=(30, 5))
sns.scatterplot(data=counts, x='Release Year', y='IMDB Rating', size='Count', ax=ax)
ax.grid(axis='y')
sns.despine(left=True, bottom=True)

scatter output


Data for reference:

url = 'https://raw.githubusercontent.com/vega/vega/main/docs/data/movies.json'
df = pd.read_json(url)[['Title', 'Release Date', 'IMDB Rating']]

df['IMDB Rating'] = df['IMDB Rating'].round().astype('Int8')
df['Release Year'] = pd.to_datetime(df['Release Date']).dt.year
df = df.loc[df['Release Year'] <= 2010]

Upvotes: 4

Related Questions