How to plot a histogram with plot.hist for continous data in a dataframe in pandas?

In this data set I need to plot,pH as the x-column which is having continuous data and need to group it together the pH axis as per the quality value and plot the histogram. In many of the resources I referred I found solutions for using random data generated. I tried this piece of code.

plt.hist(, density=True, bins=1)  
plt.ylabel('quality')
plt.xlabel('pH');

Where I eliminated the random generated data, but I received and error

File "<ipython-input-16-9afc718b5558>", line 1
    plt.hist(, density=True, bins=1) 
             ^
SyntaxError: invalid syntax

What is the proper way to plot my data?I want to feed into the histogram not randomly generated data, but data found in the data set.

Upvotes: 1

Answers (2)

Mr. T

Reputation: 12410

Several possibilities here to represent multiple histograms. All have in common that the data have to be transformed from long to wide format - meaning, each category is in its own column:

import matplotlib.pyplot as plt
import pandas as pd

#test data generation
import numpy as np
np.random.seed(123)
n=300
df = pd.DataFrame({"A": np.random.randint(1, 100, n), "pH": 3*np.random.rand(n), "quality": np.random.choice([3, 4, 5, 6], n)})
df.pH += df.quality
#instead of this block you have to read here your stored data, e.g.,
#df = pd.read_csv("my_data_file.csv")
#check that it read the correct data
#print(df.dtypes)
#print(df.head(10))


#bringing the columns in the required wide format
plot_df = df.pivot(columns="quality")["pH"]
bin_nr=5

#creating three subplots for different ways to present the same histograms
fig, (ax1, ax2, ax3) = plt.subplots(3, 1, figsize=(6, 12))

ax1.hist(plot_df, bins=bin_nr, density=True, histtype="bar", label=plot_df.columns)
ax1.legend()
ax1.set_title("Basically bar graphs")

plot_df.plot.hist(stacked=True, bins=bin_nr, density=True, ax=ax2)
ax2.set_title("Stacked histograms")

plot_df.plot.hist(alpha=0.5, bins=bin_nr, density=True, ax=ax3)
ax3.set_title("Overlay histograms")
plt.show()

Sample output:

It is not clear, though, what you intended to do with just one bin and why your y-axis was labeled "quality" when this axis represents the frequency in a histogram.

Upvotes: 2

Marmaduke

Reputation: 591

Your Error

The immediate problem in your code is the missing data to the plt.hist() command.

plt.hist(, density=True, bins=1)

should be something like:

plt.hist(data_table['pH'], density=True, bins=1)

Seaborn histplot

But this doesn't get the plot broken down by quality. The answer by Mr.T looks correct, but I'd also suggest seaborn which works with "melted" data like you have. The histplot command should give you what you want:

import seaborn as sns
sns.histplot(data=df, x="pH", hue="quality", palette="Dark2", element='step')

Assuming the table you posted is in a pandas.DataFrame named df with columns "pH" and "quality", you get something like:

The palette (Dark2) can can be any matplotlib colormap.

Subplots

If the overlaid histograms are too hard to see, an option is to do facets or small multiples. To do this with pandas and matplotlib:

# group dataframe by quality values
data_by_qual = df.groupby('quality')

# create a sub plot for each quality group
fig, axes = plt.subplots(nrows=len(data_by_qual), 
                         figsize=[6,12],
                         sharex=True)
fig.subplots_adjust(hspace=.5)

# loop over axes and quality groups together
for ax, (quality, qual_data) in zip(axes, data_by_qual):
    ax.hist(qual_data['pH'], bins=10)
    ax.set_title(f"quality = {quality}")
    ax.set_xlabel('pH')

Altair Facets

The plotting library altair can do this for you:

import altair as alt
alt.Chart(df).mark_bar().encode(
    alt.X("pH:Q", bin=True),
    y='count()',
).facet(row='quality')

Upvotes: 3