Ruan
Ruan

Reputation: 189

Plot "stacked" density distributions of variables, categorized by 0 or 1, in Python

I have the following dataset:

df = pd.DataFrame(np.random.randint(0, 100, size=(100, 6)), columns = ['Var_1', 'Var_2', 'Var_3', 'Var_4', 'Var_5', 'Var_6']) 
df['Status'] = np.random.randint(0, 2, size=(100, 1))
df

Out[1]: 
    Var_1  Var_2  Var_3  Var_4  Var_5  Var_6  Status
0      32     65     48     83     60     21       1
1      44     49     65     84     52     34       1
2       9      2      3     14     82     80       1
3      66     90     97     60     28     12       0
4      28     95     64     53     39     30       1
..    ...    ...    ...    ...    ...    ...     ...
95     22      4     43      9     79     46       1
96     10     26     91     59     99     93       0
97     10     31     33     15     99     25       1
98     41     48     80     65     58     18       1
99     39     42     22     56     91     40       1

[100 rows x 7 columns]

How can I create a "stacked" density distribution plot of each variable, categorized by Status (0 or 1). I would like the plot to look like this:

enter image description here

This plot was was created in R. The plot in Python does not have to look exactly the same. What code could I use to accomplish this? Thank you

Upvotes: 2

Views: 1621

Answers (1)

JohanC
JohanC

Reputation: 80449

Here is an adaption of seaborn's ridgeplot example for the given structure. Here multiple='stack' is selected in sns.kdeplot (the default is multiple='layer' plotting them both starting from y=0). Note that common_norm defaults to True, which scales down both curves in proportion to the number of samples.

As seaborn works with data in "long form", pd.melt() transforms the given dataframe. The long form looks like:

      Status variable      value
0          0    Var 1  -0.961877
1          1    Var 1   6.454942
2          0    Var 1   6.020015
3          0    Var 1   7.094057
4          0    Var 1  10.289022
      ...      ...        ...
2995       0    Var 6  -5.718156
2996       0    Var 6  -5.142314
2997       0    Var 6  -5.155104
2998       1    Var 6   3.339401
2999       1    Var 6   7.912669

Here is a full code example:

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

sns.set_theme(style="white", rc={"axes.facecolor": (0, 0, 0, 0)})

# Create the data
rs = np.random.RandomState(1979)
data = rs.randn(30, 100).cumsum(axis=1).reshape(-1, 6)
column_names = [f'Var {i}' for i in range(1, 7)]
df = pd.DataFrame(data, columns=column_names)
df['Status'] = rs.randint(0, 2, len(df))
for col in column_names:
    df.loc[df['Status'] == 1, col] += 5
df_long = df.melt(id_vars='Status', value_vars=column_names)

# Initialize the FacetGrid object
g = sns.FacetGrid(data=df_long, row="variable", aspect=6, height=1.8)

# Draw the densities
g.map_dataframe(sns.kdeplot, "value",
                bw_adjust=.5, clip_on=False, fill=True, alpha=1, linewidth=1.5,
                hue="Status", hue_order=[0, 1], palette=['tomato', 'turquoise'], multiple='stack')
g.map(plt.axhline, y=0, lw=2, clip_on=False, color='black')

# Define and use a simple function to label the plot in axes coordinates
def label(x, color):
    ax = plt.gca()
    ax.text(0, .2, x.iloc[0], fontweight="bold", color='black',
            ha="left", va="center", transform=ax.transAxes)

g.map(label, "variable")

# Set the subplots to overlap
g.fig.subplots_adjust(hspace=-.25)

# Remove axes details that don't play well with overlap
g.set_titles("")
g.set(yticks=[], xlabel="")
g.despine(bottom=True, left=True)
plt.show()

seaborn ridge plot with stacked kdeplots

Upvotes: 1

Related Questions