Reputation: 2000
I have a binary classification problem, which I want to solve with a RandomForestClassifier. My target column is 'successful' which is either 0 or 1. I want to investigate the data, and see how it looks like. For that I tried to do count plots by category. But it's not saying how much in percentage from total are 'successful' (i.e. successful == 1)
How can I change the following plot, so that these subplots display the percentage of (successful == 1) of total of all posts? (Let's say in category weekday, in day 'Saturday' I have 10 datapoints, 7 of them are successful ('successful' == 1), so I want to have a bar with points at that day at 0.7.
Here is the actual plot (counts :-/):
And here is a part of my dataframe:
And here is the actual code used to generate the actual plot:
# Plot
sns.set(style="darkgrid")
x_vals = [['page_name', 'weekday'],['type', 'industry']]
subtitles = [['by Page', 'by Weekday'],['by Content Type', 'by Industry']]
fig, ax = plt.subplots(2,2, figsize=(15,10))
#jitter = [[False, 1], [0.5, 0.2]]
for j in range(len(ax)):
for i in range(len(ax[j])):
ax[j][i].tick_params(labelsize=15)
ax[j][i].set_xlabel('label', fontsize=17, position=(.5,20))
if (j == 0) :
ax[j][i].tick_params(axis="x", rotation=50)
ax[j][i].set_ylabel('label', fontsize=17)
ax[j][i] = sns.countplot(x=x_vals[j][i], hue="successful", data=mainDf, ax=ax[j][i])
for j in range(len(ax)):
for i in range(len(ax[j])):
ax[j][i].set_xlabel('', fontsize=17)
ax[j][i].set_ylabel('count', fontsize=17)
ax[j][i].set_title(subtitles[j][i], fontsize=18)
fig.suptitle('Success Count by Category', position=(.5,1.05), fontsize=20)
fig.tight_layout()
fig.show()
PS: Please not, I am using Seaborn. Solution should be also with Seaborn, if possible. Thanks!
Upvotes: 4
Views: 5816
Reputation: 3902
You can use barplot
here. I wasn't 100% sure of what you actually want to achieve so I developed several solutions.
Frequency of successful (unsuccessful) per total successful (unsuccessful)
fig, axes = plt.subplots(2, 2, figsize=(15, 10))
mainDf['frequency'] = 0 # a dummy column to refer to
for col, ax in zip(['page_name', 'weekday', 'type', 'industry'], axes.flatten()):
counts = mainDf.groupby([col, 'successful']).count()
freq_per_group = counts.div(counts.groupby('successful').transform('sum')).reset_index()
sns.barplot(x=col, y='frequency', hue='successful', data=freq_per_group, ax=ax)
Frequency of successful (unsuccessful) per group
fig, axes = plt.subplots(2, 2, figsize=(15, 10))
mainDf['frequency'] = 0 # a dummy column to refer to
for col, ax in zip(['page_name', 'weekday', 'type', 'industry'], axes.flatten()):
counts = mainDf.groupby([col, 'successful']).count()
freq_per_group = counts.div(counts.groupby(col).transform('sum')).reset_index()
sns.barplot(x=col, y='frequency', hue='successful', data=freq_per_group, ax=ax)
which, based on the data you provided, gives
Frequency of successful (unsuccessful) per total
fig, axes = plt.subplots(2, 2, figsize=(15, 10))
mainDf['frequency'] = 0 # a dummy column to refer to
total = len(mainDf)
for col, ax in zip(['page_name', 'weekday', 'type', 'industry'], axes.flatten()):
counts = mainDf.groupby([col, 'successful']).count()
freq_per_total = counts.div(total).reset_index()
sns.barplot(x=col, y='frequency', hue='successful', data=freq_per_total, ax=ax)
Upvotes: 3
Reputation: 744
Change the line ax[j][i] = sns.countplot(x=x_vals[j][i], hue="successful", data=mainDf, ax=ax[j][i])
to ax[j][i] = sns.barplot(x=x_vals[j][i], y='successful', data=mainDf, ax=ax[j][i], ci=None, estimator=lambda x: sum(x) / len(x) * 100)
Your code would be
sns.set(style="darkgrid")
x_vals = [['page_name', 'weekday'],['type', 'industry']]
subtitles = [['by Page', 'by Weekday'],['by Content Type', 'by Industry']]
fig, ax = plt.subplots(2,2, figsize=(15,10))
#jitter = [[False, 1], [0.5, 0.2]]
for j in range(len(ax)):
for i in range(len(ax[j])):
ax[j][i].tick_params(labelsize=15)
ax[j][i].set_xlabel('label', fontsize=17, position=(.5,20))
if (j == 0) :
ax[j][i].tick_params(axis="x", rotation=50)
ax[j][i].set_ylabel('label', fontsize=17)
ax[j][i] = sns.barplot(x=x_vals[j][i], y='successful', data=mainDf, ax=ax[j][i], ci=None, estimator=lambda x: sum(x) / len(x) * 100)
for j in range(len(ax)):
for i in range(len(ax[j])):
ax[j][i].set_xlabel('', fontsize=17)
ax[j][i].set_ylabel('percent', fontsize=17)
ax[j][i].set_title(subtitles[j][i], fontsize=18)
fig.suptitle('Success Percentage by Category', position=(.5,1.05), fontsize=20)
fig.tight_layout()
fig.show()
Upvotes: 0