dreme
dreme

Reputation: 4961

Plotting histograms from grouped data in a pandas DataFrame

How do I plot a block of histograms from a group of data in a dataframe? For example, given:

from pandas import DataFrame
import numpy as np
x = ['A']*300 + ['B']*400 + ['C']*300
y = np.random.randn(1000)
df = DataFrame({'Letter': x, 'N': y})

I tried:

df.groupby('Letter').hist()

...which failed with the error message:

TypeError: cannot concatenate 'str' and 'float' objects

Upvotes: 102

Views: 229847

Answers (6)

dreme
dreme

Reputation: 4961

I'm on a roll, just found an even simpler way to do it using the by keyword in the hist method:

df.hist('N', by='Letter')

That's a very handy little shortcut for quickly scanning your grouped data!

For future visitors, the product of this call is the following chart:

output of hist plot command

In answer to questions below, here's an example of specific tailoring of the histogram plots:

# import libraries
import pandas as pd
import numpy as np

# Create test dataframe
x = ['A']*300 + ['B']*400 + ['C']*300
y = np.random.randn(1000)
z = np.random.randn(1000)
df = pd.DataFrame({'Letter':x, 'N1':y, 'N2':z})

# Plot histograms
axes = df.hist(['N1','N2'], by='Letter',bins=10, layout=(2,2),
               legend=True, yrot=90,sharex=True,sharey=True, 
               log=True, figsize=(6,6))
for ax in axes.flatten():
    ax.set_xlabel('N')
    ax.set_ylabel('Count')
    ax.set_ylim(bottom=1,top=100)

enter image description here

Upvotes: 265

Union find
Union find

Reputation: 8150

I find this even easier and faster.

data_df.groupby('Letter').count()['N'].hist(bins=100)

Upvotes: 2

Gabriele
Gabriele

Reputation: 731

I write this answer because I was looking for a way to plot together the histograms of different groups. What follows is not very smart, but it works fine for me. I use Numpy to compute the histogram and Bokeh for plotting. I think it is self-explanatory, but feel free to ask for clarifications and I'll be happy to add details (and write it better).

figures = {
    'Transit': figure(title='Transit', x_axis_label='speed [km/h]', y_axis_label='frequency'),
    'Driving': figure(title='Driving', x_axis_label='speed [km/h]', y_axis_label='frequency')
}

cols = {'Vienna': 'red', 'Turin': 'blue', 'Rome': 'Orange'}
for gr in df_trips.groupby(['locality', 'means']):
    locality = gr[0][0]
    means = gr[0][1]
    fig = figures[means]
    h, b = np.histogram(pd.DataFrame(gr[1]).speed.values)
    fig.vbar(x=b[1:], top=h, width=(b[1]-b[0]), legend_label=locality, fill_color=cols[locality], alpha=0.5)

show(gridplot([
    [figures['Transit']],
    [figures['Driving']],
]))

Upvotes: 1

dirkjot
dirkjot

Reputation: 3736

With recent version of Pandas, you can do df.N.hist(by=df.Letter)

Just like with the solutions above, the axes will be different for each subplot. I have not solved that one yet.

Upvotes: 10

Paul
Paul

Reputation: 7325

One solution is to use matplotlib histogram directly on each grouped data frame. You can loop through the groups obtained in a loop. Each group is a dataframe. And you can create a histogram for each one.

from pandas import DataFrame
import numpy as np
x = ['A']*300 + ['B']*400 + ['C']*300
y = np.random.randn(1000)
df = DataFrame({'Letter':x, 'N':y})
grouped = df.groupby('Letter')

for group in grouped:
  figure()
  matplotlib.pyplot.hist(group[1].N)
  show()

Upvotes: 14

cwharland
cwharland

Reputation: 6713

Your function is failing because the groupby dataframe you end up with has a hierarchical index and two columns (Letter and N) so when you do .hist() it's trying to make a histogram of both columns hence the str error.

This is the default behavior of pandas plotting functions (one plot per column) so if you reshape your data frame so that each letter is a column you will get exactly what you want.

df.reset_index().pivot('index','Letter','N').hist()

The reset_index() is just to shove the current index into a column called index. Then pivot will take your data frame, collect all of the values N for each Letter and make them a column. The resulting data frame as 400 rows (fills missing values with NaN) and three columns (A, B, C). hist() will then produce one histogram per column and you get format the plots as needed.

Upvotes: 9

Related Questions