Reputation: 1063
I'd like to make a scatter plot from a Dataframe, where each point is visualized with a unique color in dependence how often that value occured. As example, I have the following dataframe, consisting of lists of two numeric values:
df = pd.DataFrame({'width': image_widths, 'height': image_heights})
df.head(10)
height width
0 1093 640
1 1136 639
2 1095 640
3 1136 639
4 1095 640
5 1100 640
6 1136 640
7 1136 639
8 1136 640
9 1031 640
Now, as you see, some value-pairs occure multiple times. For example (1095/640) occures at index 2 and 4. How do I give this dot a color representing "Two occurences". And it would be even better, if the color is picked automatically from a continous spectrum, like in a colorbar plot. Such that already the color-shade gives you an impression of the frequency, rather then by manually looking up what the color represents it.
An alternative to coloring, I also would appreciate, is having the frequency of occurences coded as radius of the dots.
EDIT:
To specify my question, I figured out, that df.groupby(['width','height']).size()
gives me the count of all combinations.
Now I lack the skill to link this information with the color (or size) of the dots in the plot.
Upvotes: 0
Views: 1989
Reputation: 12590
So let's make this a true Minimal, Complete, and Verifiable example:
import matplotlib.pyplot as plt
import pandas as pd
image_heights = [1093, 1136, 1095, 1136, 1095, 1100, 1136, 1136, 1136, 1031]
image_widths = [640, 639, 640, 639, 640, 640, 640, 639, 640, 640]
df = pd.DataFrame({'width': image_widths, 'height': image_heights})
print(df)
width height
0 640 1093
1 639 1136
2 640 1095
3 639 1136
4 640 1095
5 640 1100
6 640 1136
7 639 1136
8 640 1136
9 640 1031
You want the sizes (counts) along with the widths and heights in a DataFrame
:
plot_df = df.groupby(['width','height']).size().reset_index(name='count')
print(plot_df)
width height count
0 639 1136 3
1 640 1031 1
2 640 1093 1
3 640 1095 2
4 640 1100 1
5 640 1136 2
The colors and sizes in a scatterplot are controled by the c
and s
keywords if you use DataFrame.plot.scatter
:
plot_df.plot.scatter(x='height', y='width', s=10 * plot_df['count']**2,
c='count', cmap='viridis')
Upvotes: 4