distracted-biologist
distracted-biologist

Reputation: 808

Excel create scatterplot by categorical variable

I'm trying to create a scatterplot in excel with the following data:

import pandas as pd   
import seaborn as sns   
import numpy as np   

df = pd.DataFrame(data=np.random.random((100, 2)), columns=['X', 'Y'])
df['category'] = np.random.choice(['groupA', 'groupB', 'groupC'], size=len(df))

df.head()

    X           Y           category
0   0.407807    0.269535    groupB
1   0.658499    0.671735    groupB
2   0.071938    0.199920    groupC
3   0.197185    0.358241    groupA
4   0.033922    0.917605    groupB

In seaborn I would just do something like:

sns.scatterplot(x='X', y='Y', data=df, hue='category')

enter image description here

But I am unable to create this is in excel. Do I really have to create a new column for each category?

Upvotes: 0

Views: 1482

Answers (1)

Not sure if there is a better way to do this in Excel but you need to resume your data using a Pivot Table, and then copy/paste the output as range somewhere else and then do your scatterplot (You can't do an scatterplot directly taking a Pivot Table as a source):

enter image description here

The setup of my Pivot Table is:

enter image description here

  1. Field X to rows section
  2. Field Category to Columns section
  3. Field Y to values section, set it to Sum

Anyways this method got an issue: If there are 2 or more points in same category with same X value but different Y values, then Pivot Table will sum them up and that's an error.

What you need to create the scatterlot is a source with the structure showed in step 3. It there is any kind od semiduplicates as mentioned before, I'm afraid you'll need to construct manually (or VBA code) your dataframe with this structure:

enter image description here

Do I really have to create a new column for each category? I'm afraid that yes, you do

Upvotes: 1

Related Questions