Tclha
Tclha

Reputation: 105

How to plot a scatter plot with values against a category and colored by a different category

I have a Python Pandas dataframe in the following format:

gender disease1 disease2
male 0.82 0.76
female 0.75 0.93
...... .... ....

I'm looking to plot this in Python (matplotlib, or plotly express, etc.) so that it looks like something this:

plot example

How can I restructure my dataframe and/or use a python visualisation library to achieve this result?

Upvotes: 1

Views: 873

Answers (2)

Trenton McKinney
Trenton McKinney

Reputation: 62383

  • The easiest option is to use seaborn.catplot with kind='swarm' or kind='strip'.
    • seaborn is a high-level API for matplotlib
    • seaborn: Plotting with categorical data
    • 'swarm' draws a categorical scatterplot with non-overlapping points, but if there are many points, consider using 'strip'.
  • Reshape the dataframe from a wide to long format with pandas.DataFrame.melt, and then plot.
    • Incidentally, this is just two lines of code, (1) melt, and (2) plot
  • Tested in python 3.8.11, pandas 1.3.2, matplotlib 3.4.3, seaborn 0.11.2
import pandas as pd
import numpy as np  # only for sample data
import seaborn as sns

np.random.seed(365)
rows = 200
data = {'Gender': np.random.choice(['Male', 'Female'], size=rows),
        'Cancer': np.random.rand(rows).round(2),
        'Covid-19': np.random.rand(rows).round(2)}
df = pd.DataFrame(data)

# display(df.head())
   Gender  Cancer  Covid-19
0    Male    0.82      0.88
1    Male    0.02      0.95
2  Female    0.28      0.92
3  Female    0.55      0.28
4    Male    0.15      0.46

# convert to long form
data = df.melt(id_vars='Gender', var_name='Disease')

# display(data.head())
   Gender Disease  value
0    Male  Cancer   0.82
1    Male  Cancer   0.02
2  Female  Cancer   0.28
3  Female  Cancer   0.55
4    Male  Cancer   0.15

# plot
sns.catplot(data=data, x='Disease', y='value', hue='Gender', kind='swarm', palette=['blue', 'pink'], s=4)

enter image description here

Upvotes: 1

Derek O
Derek O

Reputation: 19545

You can create a scatterplot in Plotly where disease1 is located at x=0 and disease2 is located at x=1... and so on for more diseases, then rename the tickmarks, and set the color and offset of the marker depending on the gender.

The most dynamic way to make this plot is to add the data as you slice the DataFrame by disease and gender (I added some more points to your DataFrame to demonstrate that you can keep your DataFrame in the same format and achieve the desired plot):

import pandas as pd
import plotly.graph_objects as go

df = pd.DataFrame({'gender':['male','female','male','female'],'disease1':[0.82,0.75,0.60,0.24],'disease2':[0.76,0.93,0.51,0.44]})


fig = go.Figure()
offset = {'male': -0.1, 'female': 0.1}
marker_color_dict = {'male': 'teal', 'female':'pink'}

## set yaxis range
values = df[['disease1','disease2']].values.reshape(-1)
padding = 0.1
fig.update_yaxes(range=[min(values) - padding, 1.0])

for gender in ['male','female']:
    for i, disease in enumerate(['disease1','disease2']):
        ## ensure that 
        if gender == 'male' and i == 0:
            showlegend=True
        elif gender == 'female' and i == 0:
            showlegend=True
        else:
            showlegend=False
        fig.add_trace(go.Scatter(
            x=[i + offset[gender]]*len(df.loc[df['gender'] == gender, 'disease1'].values), 
            y=df.loc[df['gender'] == gender, disease].values,
            mode='markers',
            marker=dict(color=marker_color_dict[gender], size=20),
            legendgroup=gender,
            name=gender,
            showlegend=showlegend
        ))
fig.update_layout(
    xaxis = dict(
        tickmode = 'array',
        tickvals = [0.0,1.0],
        ticktext = ['disease1','disease2']
    )
)
fig.show()

enter image description here

Upvotes: 2

Related Questions