Reputation: 2619

Pandas histogram df.hist() group by

How to plot a histogram with pandas DataFrame.hist() using group by? I have a data frame with 5 columns: "A", "B", "C", "D" and "Group"

There are two Groups classes: "yes" and "no"

Using:

df.hist()

I get the hist for each of the 4 columns.

Now I would like to get the same 4 graphs but with blue bars (group="yes") and red bars (group = "no").

I tried this withouth success:

df.hist(by = "group")

Upvotes: 22

Answers (4)

Mohit Burkule

Reputation: 161

TLDR oneliner;
It won't create the subplots but will create 4 different plots;

[df.groupby('group')[i].plot(kind='hist',title=i)[0] and plt.legend() and plt.show() for i in 'ABCD']

Full working example below

import numpy as np; np.random.seed(1)
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

df = pd.DataFrame(np.random.randn(300,4), columns=list("ABCD"))
df["group"] = np.random.choice(["yes", "no"], p=[0.32,0.68],size=300)

[df.groupby('group')[i].plot(kind='hist',title=i)[0] and plt.legend() and plt.show() for i in 'ABCD']

Upvotes: 0

Shant Malkasian

Reputation: 1

I generalized one of the other comment's solutions. Hope it helps someone out there. I added a line to ensure binning (number and range) is preserved for each column, regardless of group. The code should work for both "binary" and "categorical" groupings, i.e. "by" can specify a column wherein there are N number of unique groups. Plotting also stops if the number of columns to plot exceeds the subplot space.

import numpy as np
import matplotlib.pyplot as plt

def composite_histplot(df, columns, by, nbins=25, alpha=0.5):
    def _sephist(df, col, by):
        unique_vals = df[by].unique()
        df_by = dict()
        for uv in unique_vals:
            df_by[uv] = df[df[by] == uv][col]
        return df_by
    subplt_c = 4
    subplt_r = 5
    fig = plt.figure()
    for num, col in enumerate(columns):
        if num + 1 > subplt_c * subplt_r:
            continue
        plt.subplot(subplt_c, subplt_r, num+1)
        bins = np.linspace(df[col].min(), df[col].max(), nbins)
        for lbl, sepcol in _sephist(df, col, by).items():
            plt.hist(sepcol, bins=bins, alpha=alpha, label=lbl)
            plt.legend(loc='upper right', title=by)
            plt.title(col)
    plt.tight_layout()
    
    return fig

Upvotes: 0

Brad Solomon

Reputation: 40938

This is not the most flexible workaround but will work for your question specifically.

def sephist(col):
    yes = df[df['group'] == 'yes'][col]
    no = df[df['group'] == 'no'][col]
    return yes, no

for num, alpha in enumerate('abcd'):
    plt.subplot(2, 2, num)
    plt.hist(sephist(alpha)[0], bins=25, alpha=0.5, label='yes', color='b')
    plt.hist(sephist(alpha)[1], bins=25, alpha=0.5, label='no', color='r')
    plt.legend(loc='upper right')
    plt.title(alpha)
plt.tight_layout(pad=0.4, w_pad=0.5, h_pad=1.0)

You could make this more generic by:

adding a df and by parameter to sephist: def sephist(df, by, col)
making the subplots loop more flexible: for num, alpha in enumerate(df.columns)

Because the first argument to matplotlib.pyplot.hist can take

either a single array or a sequency of arrays which are not required to be of the same length

...an alternattive would be:

for num, alpha in enumerate('abcd'):
    plt.subplot(2, 2, num)
    plt.hist((sephist(alpha)[0], sephist(alpha)[1]), bins=25, alpha=0.5, label=['yes', 'no'], color=['r', 'b'])
    plt.legend(loc='upper right')
    plt.title(alpha)
plt.tight_layout(pad=0.4, w_pad=0.5, h_pad=1.0)

Upvotes: 17

ImportanceOfBeingErnest

Reputation: 339745

Using Seaborn

If you are open to use Seaborn, a plot with multiple subplots and multiple variables within each subplot can easily be made using seaborn.FacetGrid.

import numpy as np; np.random.seed(1)
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

df = pd.DataFrame(np.random.randn(300,4), columns=list("ABCD"))
df["group"] = np.random.choice(["yes", "no"], p=[0.32,0.68],size=300)

df2 = pd.melt(df, id_vars='group', value_vars=list("ABCD"), value_name='value')

bins=np.linspace(df2.value.min(), df2.value.max(), 10)
g = sns.FacetGrid(df2, col="variable", hue="group", palette="Set1", col_wrap=2)
g.map(plt.hist, 'value', bins=bins, ec="k")

g.axes[-1].legend()
plt.show()

Upvotes: 20

Pandas histogram df.hist() group by

Answers (4)

Using Seaborn

Related Questions