user14744497
user14744497

Reputation: 33

Colour code the plot based on the two data frame values

I would like to colour code the scatter plot based upon the two data frame values such that for each different values of df[1], a new color is to be assigned and for each df[2] value having same df[1] value, the assigned color earlier needs the opacity variation with highest value of df[2] (among df[2] values having same df[1] value) getting 100 % opaque and the lowest getting least opaque among the group of the data points.

Here is the code:

def func():
    ...

df = pd.read_csv(PATH + file, sep=",", header=None)


b = 2.72
a = 0.00000009

popt, pcov = curve_fit(func, df[2], df[5]/df[4], p0=[a,b])

perr = np.sqrt(np.diag(pcov))

plt.scatter(df[1], df[5]/df[4]/df[2])
# Plot responsible for the datapoints in the figure

plt.plot(df[1], func_cpu(df[2], *popt)/df[2], "r")
# plot responsible for the curve in the figure

plt.legend(loc="upper left")

Here is the sample dataset:

**df[0],df[1],df[2],df[3],df[4],df[5],df[6]**

file_name_1_i1,31,413,36120,10,9,10
file_name_1_i2,31,1240,60488,10,25,27
file_name_1_i3,31,2769,107296,10,47,48
file_name_1_i4,31,8797,307016,10,150,150
file_name_2_i1,34,72,10868,11,9,10
file_name_2_i2,34,6273,250852,11,187,196
file_name_3_i1,36,84,29568,12,9,10
file_name_3_i2,36,969,68892,12,25,26
file_name_3_i3,36,6545,328052,12,150,151
file_name_4_i1,69,116,40712,13,25,26
file_name_4_i2,69,417,80080,13,47,48
file_name_4_i2,69,1313,189656,13,149,150
file_name_4_i4,69,3009,398820,13,195,196
file_name_4_i5,69,22913,2855044,13,3991,4144
file_name_5_i1,85,59,48636,16,47,48
file_name_5_i2,85,163,64888,15,77,77
file_name_5_i3,85,349,108728,16,103,111
file_name_5_i4,85,1063,253180,14,248,248
file_name_5_i5,85,2393,526164,15,687,689
file_name_5_i6,85,17713,3643728,15,5862,5867
file_name_6_i1,104,84,75044,33,137,138
file_name_6_i2,104,455,204792,28,538,598
file_name_6_i3,104,1330,513336,31,2062,2063
file_name_6_i4,104,2925,1072276,28,3233,3236
file_name_6_i5,104,6545,2340416,28,7056,7059
...

So, the x-axis would be df[1] which are 31, 31, 31, 31, 34, 34,... and the y-axis is df[5], df[4], df[2] which are 9, 10, 413. For each different value of df[1], a new colour needs to be assigned. It would be fine to repeat the color cycles say after 6 unique colours. And among each color the opacity needs to be changed wrt to the value of df[2] (though y-axis is df[5], df[4], df[2]). The highest getting the darker version of the same color, and the lowest getting the lightest version of the same color.

and the scatter plot:

scatter_plot

This is roughly how my desired solution of the color code needs to look like:

desired colorcode enter image description here

I have around 200 entries in the csv file.

Does using NumPy in this scenario is more advantageous ?

Upvotes: 3

Views: 611

Answers (2)

Mr. T
Mr. T

Reputation: 12410

Well, what do you know. I understood this task totally differently. I thought the point was to have alpha levels according to all df[2], df[4], and df[5] values for each df[1] value. Oh well, since I have done the work already, why not post it?

from matplotlib import pyplot as plt
import pandas as pd
from itertools import cycle
from matplotlib.colors import to_rgb

#read the data, column numbers will be generated automatically
df = pd.read_csv("data.txt", sep = ",", header=None)

#our figure with the ax object
fig, ax = plt.subplots(figsize=(10,10))
#definition of the colors
sc_color = cycle(["tab:orange", "red", "blue", "black"])

#get groups of the same df[1] value, they will also be sorted at the same time
dfgroups = df.iloc[:, [2, 4, 5]].groupby(by=df[1])

#plot each group with a different colour
for groupkey, groupval in dfgroups:
    #create group dataframe with df[1] value as x and df[2], df[4], and df[5] values as y
    groupval= groupval.melt(var_name="x", value_name="y")
    groupval.x = groupkey
    
    #get  min and max y for the normalization
    y_high = groupval.y.max()
    y_low = groupval.y.min()
    #read out r, g, and b values of the next color in the cycle
    r, g, b = to_rgb(next(sc_color))
    #create a colour array with nonlinear normalized alpha levels 
    #between 0.2 and 0.8, so that all data point are visible
    group_color = [(r, g, b, 0.19 + 0.8 * ((y_high-val) / (y_high-y_low))**7) for val in groupval.y]
    #and plot
    ax.scatter(groupval.x, groupval.y, c=group_color)
    
    
plt.show()

Sample output of your data:

enter image description here

Two main problems here. One is that alpha in a scatter plot does not accept an array. But color does, hence, the detour to read out the RGB values and create an RGBA array with added alpha levels.
The other is that your data are spread over a rather wide range. A linear normalization makes changes near the lowest values invisible. There is surely some optimization possible; I like for instance this suggestion.

Upvotes: 0

sai
sai

Reputation: 1784

Let me know if this is appropriate or if I have misunderstood anything-

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

# not needed for you
# df = pd.read_csv('~/Documents/tmp.csv')

max_2 = pd.DataFrame(df.groupby('1').max()['2'])

no_unique_colors = 3
color_set = [np.random.random((3)) for _ in range(no_unique_colors)]
# assign colors to unique df2 in cyclic order
max_2['colors'] = [color_set[unique_df2 % no_unique_colors] for unique_df2 in range(max_2.shape[0])]

# calculate the opacities for each entry in the dataframe
colors = [list(max_2.loc[df1].colors) + [float(df['2'].iloc[i])/max_2['2'].loc[df1]] for i, df1 in enumerate(df['1'])]
# repeat thrice so that df2, df4 and df5 share the same opacity
colors = [x for x in colors for _ in range(3)]

plt.scatter(df['1'].values.repeat(3), df[['2', '4', '5']].values.reshape(-1), c=colors)
plt.show()

enter image description here

Upvotes: 1

Related Questions