Tlaloc-ES
Tlaloc-ES

Reputation: 5282

How to plot boxplots for two groups of data

I am plotting two different box plots with pandas with this:

plt.figure()
df['mean_train_score_error'] = [1] - df['mean_train_score']
df.boxplot(column=['mean_train_score_error'], by='modelo',
                                        medianprops = medianprops,
                                         autorange=True,showfliers=False, patch_artist=True, 
                                         vert=True, showmeans=True,meanline=True)
plt.ylabel('Error: 1-F1 Score')
plt.title('Error de entrenamiento')
plt.suptitle('')



df['mean_test_score_error'] = [1] - df['mean_test_score']
df.boxplot(column=['mean_test_score_error'], by='modelo',
                                        medianprops = medianprops,
                                         autorange=True,showfliers=False, patch_artist=True, 
                                         vert=True, showmeans=True,meanline=True)

plt.ylabel('Error: 1-F1 Score')
plt.title('Error de validación')
plt.suptitle('')

And I am getting the following two plots:

enter image description here

enter image description here

The question is if is possible plot the 6 boxplot on the same plot and to use different color for the each three boxplot of the each plot?

Upvotes: 0

Views: 170

Answers (1)

Trenton McKinney
Trenton McKinney

Reputation: 62543

  • The easiest way to do this is transform the data from a wide to long format, and then plot with seaborn, using the hue parameter.
  • pandas.wide_to_long
    • There must be a unique id, hence adding the id column.
    • The columns being transformed, must have similar stubnames, which is why I moved error to the front of the column name.
      • The error column names will be in one column and the value in a separate column

Imports and Test Data

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# setup data and dataframe
np.random.seed(365)
data = {'mod_lg': np.random.normal(0.3, .1, size=(30,)),
        'mod_rf': np.random.normal(0.05, .01, size=(30,)),
        'mod_bg': np.random.normal(0.02, 0.002, size=(30,)),
        'mean_train_score': np.random.normal(0.95, 0.3, size=(30,)),
        'mean_test_score': np.random.normal(0.86, 0.5, size=(30,))}

df = pd.DataFrame(data)
df['error_mean_test_score'] = [1] - df['mean_test_score']
df['error_mean_train_score'] = [1] - df['mean_train_score']
df["id"] = df.index

df = pd.wide_to_long(df, stubnames='mod', i='id', j='mode', sep='_', suffix='\D+').reset_index()
df["id"] = df.index

# display dataframe: this is probably what your dataframe looks like to generate your current plots
   id mode  mean_train_score  error_mean_test_score  mean_test_score  error_mean_train_score       mod
0   0   lg          0.663855              -0.343961         1.343961                0.336145  0.316792
1   1   lg          0.990114               0.472847         0.527153                0.009886  0.352351
2   2   lg          1.179775               0.324748         0.675252               -0.179775  0.381738
3   3   lg          0.693155               0.519526         0.480474                0.306845  0.470385
4   4   lg          1.191048              -0.128033         1.128033               -0.191048  0.085305

Transform and plot

  • The error_score_name column contains the suffix from error_mean_test_score & error_mean_train_score
  • The error_score_value column contains the values.
# convert df error columns to long format
dfl = pd.wide_to_long(df, stubnames='error', i='id', j='score', sep='_', suffix='\D+').reset_index(level=1)
dfl.rename(columns={'score': 'error_score_name', 'error': 'error_score_value'}, inplace=True)

# display dfl

   error_score_name  mean_train_score       mod  mean_test_score mode  error_score_value
id                                                                                      
0   mean_test_score          0.663855  0.316792         1.343961   lg          -0.343961
1   mean_test_score          0.990114  0.352351         0.527153   lg           0.472847
2   mean_test_score          1.179775  0.381738         0.675252   lg           0.324748
3   mean_test_score          0.693155  0.470385         0.480474   lg           0.519526
4   mean_test_score          1.191048  0.085305         1.128033   lg          -0.128033

# plot dfl
sns.boxplot(x='mode', y='error_score_value', data=dfl, hue='error_score_name')

enter image description here

Upvotes: 1

Related Questions