Preston Hall
Preston Hall

Reputation: 29

How can I add tick marks in a boxplot based on a boolean value in a DataFrame?

How can I add points or tick marks to this boxplot based on the value of 'boolean_val'?

import pandas as pd
import numpy as np
import seaborn as sns

df = pd.DataFrame(np.random.rand(140, 1)*1000, columns=['int_value'])

df['boolean_value'] = np.random.random(df.shape)<0.5

sns.boxplot(x=df['int value'])
plt.show()

Upvotes: 1

Views: 1514

Answers (1)

Trenton McKinney
Trenton McKinney

Reputation: 62473

  • Set boolean_value as the x-axis to boxplot by separate categorical values.
import pandas as pd
import numpy as np
import seaborn as sns

df = pd.DataFrame(np.random.rand(140, 1)*1000, columns=['int_value'])
df['boolean_value'] = np.random.random(df.shape)<0.5

sns.boxplot(y=df['int_value'], x=df['boolean_value'])
plt.show()

enter image description here

  • What you asked for in the comment, adding the data points, is not how a boxplot works. However, a swarmplot can be added on top, to create the same effect.
sns.boxplot(y=df['int_value'], x=df['boolean_value'])
sns.swarmplot(y=df['int_value'], x=df['boolean_value'], color='black')
plt.show()

enter image description here

  • If you want a plot of only True
sns.boxplot(y=df['int_value'], x=df['boolean_value'][df['boolean_value']==True])
sns.swarmplot(y=df['int_value'], x=df['boolean_value'][df['boolean_value']==True], color='black')
plt.show()

enter image description here

  • If you want the entire distribution as a single boxplot, but only want the True data points shown.
sns.boxplot(y=df['int_value'])
sns.swarmplot(y=df['int_value'], x=df['boolean_value'][df['boolean_value']==True], color='black', label='only True')
plt.xticks([0], [''])
plt.xlabel('True/False Boxplot Distribution')
plt.legend()
plt.show()

enter image description here

Note:

  • For the sample data, it's difficult to visually discern the difference between the distribution of only True data and the combined True/False distribution.
df.describe()

        int_value
count  140.000000
mean   524.828022
std    302.097860
min      1.566518
25%    240.890088
50%    567.986782
75%    778.906109
max    995.508649

df.groupby('boolean_value').describe()

              int_value                                                                                  
                  count        mean         std       min         25%         50%         75%         max
boolean_value                                                                                            
False              70.0  525.125956  291.117406  1.566518  247.411473  577.119686  770.783246  995.508649
True               70.0  524.530087  314.800514  8.077607  233.074629  550.306306  828.866101  993.770101

Upvotes: 2

Related Questions