JohanvH
JohanvH

Reputation: 25

Plotting the mean of multiple columns including standard deviation

I have a data set with 8 columns and several rows. The columns contain measurements for different variable (6 in total) under 2 different conditions, each consisting of 4 columns that contain repeated measurements for a particular condition.

Using Searborn, I would like to generate a bar chart displaying the mean and standard deviation of every 4 columns, grouped by index key (i.e. measured variable). The dataframe structure is as follows:

np.random.seed(10)
df = pd.DataFrame({
    'S1_1':np.random.randn(6),
    'S1_2':np.random.randn(6),
    'S1_3':np.random.randn(6),
    'S1_4':np.random.randn(6),
    'S2_1':np.random.randn(6),
    'S2_2':np.random.randn(6),
    'S2_3':np.random.randn(6),
    'S2_4':np.random.randn(6),
    },index= ['var1','var2','var3','var4','var5','var6'])

How do I pass to seaborn that I would like only 2 bars, 1 for the first 4 columns and 1 for the second. With each bar displaying the mean (and standard deviation or some other measure of dispersion) across 4 columns.

I was thinking of using multi-indexing, adding a second column level to group the columns into 2 condition,

df.columns = pd.MultiIndex.from_arrays([['Condition 1'] * 4 + ['Condition 2'] * 4,df.columns])

but I can't figure out what I should pass to Seaborn to generate the plot I want.

If anyone could point me in the right direction, that would be a great help!

Upvotes: 1

Views: 3479

Answers (1)

Trenton McKinney
Trenton McKinney

Reputation: 62523

Update Based on Comment

  • Plotting is all about reshaping the dataframe for the plot API
# still create the groups
l = df.columns
n = 4
groups = [l[i:i+n] for i in range(0, len(l), n)]
num_gps = len(groups)

# stack each group and add an id column
data_list = list()
for group in groups:
    id_ = group[0][1]
    data = df[group].copy().T
    data['id_'] = id_
    data_list.append(data)
    
df2 = pd.concat(data_list, axis=0).reset_index()
df2.rename({'index': 'sample'}, axis=1, inplace=True)

# melt df2 into a long form
dfm = df2.melt(id_vars=['sample', 'id_'])

# plot
p = sns.catplot(kind='bar', data=dfm, x='variable', y='value', hue='id_', ci='sd', aspect=3)

df2.head()

  sample    YAL001C    YAL002W   YAL004W   YAL005C   YAL007C   YAL008W    YAL011W   YAL012W    YAL013W   YAL014C id_
0   S2_1 -13.062716  -8.084685  2.360795 -0.740357  3.086768 -0.117259  -5.678183  2.527573 -17.326287 -1.319402   2
1   S2_2  -5.431474 -12.676807  0.070569 -4.214761 -4.318011 -4.489010 -10.268632  0.691448 -24.189106 -2.343884   2
2   S2_3  -9.365509 -12.281169  0.497772 -3.228236  0.212941 -2.287206 -10.250004  1.111842 -27.811564 -4.329987   2
3   S2_4  -7.582111 -15.587219 -1.286167 -4.531494 -3.090265 -4.718281  -8.933496  2.079757 -21.580854 -2.834441   2
4   S3_1 -12.618254 -20.010779 -2.530541 -3.203072 -2.436503 -2.922565 -15.972632  3.551605 -35.618485 -4.925495   3

dfm.head()

  sample id_ variable      value
0   S2_1   2  YAL001C -13.062716
1   S2_2   2  YAL001C  -5.431474
2   S2_3   2  YAL001C  -9.365509
3   S2_4   2  YAL001C  -7.582111
4   S3_1   3  YAL001C -12.618254

Plot Result

enter image description here

kind='box'

  • A box plot might be a better to convey the distribution
p = sns.catplot(kind='box', data=dfm, y='variable', x='value', hue='id_', height=12)

enter image description here


Original Answer

  • Use a list comprehension to chunk the columns into groups of 4
    • This uses the original, more comprehensive data that was posted. It can be found in revision 4
  • Create a figure with subplots and zip each group to an ax from axes
  • Use each group to select data from df and transpose the data with .T.
  • Using sns.barplot the default estimator is mean, so the length of the bar is the mean, and set ci='sd' so the confidence interval is the standard deviation.
    • sns.barplot(data=data, ci='sd', ax=ax) can easily be replaced with sns.boxplot(data=data, ax=ax)
import seaborn as sns

# using the first comma separated data that was posted, create groups of 4
l = df.columns
n = 4  # chunk size for groups
groups = [l[i:i+n] for i in range(0, len(l), n)]
num_gps = len(groups)

# plot
fig, axes = plt.subplots(num_gps, 1, figsize=(12, 6*num_gps))

for ax, group in zip(axes, groups):
    data = df[group].T
    sns.barplot(data=data, ci='sd', ax=ax)
    ax.set_title(f'{group.to_list()}')
fig.tight_layout()
fig.savefig('test.png')

Example of data

  • The bar is the mean of each column, and the line is the standard deviation
       YAL001C    YAL002W   YAL004W   YAL005C   YAL007C   YAL008W    YAL011W   YAL012W    YAL013W   YAL014C
S8_1 -1.731388 -17.215712 -3.518643 -2.358103  0.418170 -1.529747 -12.630343  2.435674 -27.471971 -4.021264
S8_2 -1.325524 -24.056632 -0.984390 -2.119338 -1.770665 -1.447103 -10.618954  2.156420 -30.362998 -4.735058
S8_3 -2.024020 -29.094027 -6.146880 -2.101090 -0.732322 -2.773949 -12.642857 -0.009749 -28.486835 -4.783863
S8_4  2.541671 -13.599049 -2.688125 -2.329332 -0.694555 -2.820627  -8.498677  3.321018 -31.741916 -2.104281

Plot Result

enter image description here

Upvotes: 1

Related Questions