Reputation: 189
I have the following dataset:
df = pd.DataFrame(np.random.randint(0, 100, size=(100, 6)), columns = ['Var_1', 'Var_2', 'Var_3', 'Var_4', 'Var_5', 'Var_6'])
df['Status'] = np.random.randint(0, 2, size=(100, 1))
df
Out[1]:
Var_1 Var_2 Var_3 Var_4 Var_5 Var_6 Status
0 32 65 48 83 60 21 1
1 44 49 65 84 52 34 1
2 9 2 3 14 82 80 1
3 66 90 97 60 28 12 0
4 28 95 64 53 39 30 1
.. ... ... ... ... ... ... ...
95 22 4 43 9 79 46 1
96 10 26 91 59 99 93 0
97 10 31 33 15 99 25 1
98 41 48 80 65 58 18 1
99 39 42 22 56 91 40 1
[100 rows x 7 columns]
How can I create a "stacked" density distribution plot of each variable, categorized by Status
(0 or 1). I would like the plot to look like this:
This plot was was created in R. The plot in Python does not have to look exactly the same. What code could I use to accomplish this? Thank you
Upvotes: 2
Views: 1621
Reputation: 80449
Here is an adaption of seaborn's ridgeplot example for the given structure. Here multiple='stack'
is selected in sns.kdeplot
(the default is multiple='layer'
plotting them both starting from y=0
). Note that common_norm
defaults to True
, which scales down both curves in proportion to the number of samples.
As seaborn works with data in "long form", pd.melt()
transforms the given dataframe. The long form looks like:
Status variable value
0 0 Var 1 -0.961877
1 1 Var 1 6.454942
2 0 Var 1 6.020015
3 0 Var 1 7.094057
4 0 Var 1 10.289022
... ... ...
2995 0 Var 6 -5.718156
2996 0 Var 6 -5.142314
2997 0 Var 6 -5.155104
2998 1 Var 6 3.339401
2999 1 Var 6 7.912669
Here is a full code example:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_theme(style="white", rc={"axes.facecolor": (0, 0, 0, 0)})
# Create the data
rs = np.random.RandomState(1979)
data = rs.randn(30, 100).cumsum(axis=1).reshape(-1, 6)
column_names = [f'Var {i}' for i in range(1, 7)]
df = pd.DataFrame(data, columns=column_names)
df['Status'] = rs.randint(0, 2, len(df))
for col in column_names:
df.loc[df['Status'] == 1, col] += 5
df_long = df.melt(id_vars='Status', value_vars=column_names)
# Initialize the FacetGrid object
g = sns.FacetGrid(data=df_long, row="variable", aspect=6, height=1.8)
# Draw the densities
g.map_dataframe(sns.kdeplot, "value",
bw_adjust=.5, clip_on=False, fill=True, alpha=1, linewidth=1.5,
hue="Status", hue_order=[0, 1], palette=['tomato', 'turquoise'], multiple='stack')
g.map(plt.axhline, y=0, lw=2, clip_on=False, color='black')
# Define and use a simple function to label the plot in axes coordinates
def label(x, color):
ax = plt.gca()
ax.text(0, .2, x.iloc[0], fontweight="bold", color='black',
ha="left", va="center", transform=ax.transAxes)
g.map(label, "variable")
# Set the subplots to overlap
g.fig.subplots_adjust(hspace=-.25)
# Remove axes details that don't play well with overlap
g.set_titles("")
g.set(yticks=[], xlabel="")
g.despine(bottom=True, left=True)
plt.show()
Upvotes: 1