O'Neil Murray
O'Neil Murray

Reputation: 39

Python: Barplot colored according to a third variable

Currently I am trying to create a Barplot that shows the amount of reviews for an app per week. The bar should however be colored according to a third variable which contains the average rating of the reviews in each week (range: 1 to 5).

I followed the instructions of the following post to create the graph: Python: Barplot with colorbar

The code works fine:

# Import Packages
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.cm import ScalarMappable 

# Create Dataframe
data = [[1, 10, 3.4], [2, 15, 3.9], [3, 12, 3.6], [4, 30,1.2]] 
df = pd.DataFrame(data, columns = ["week", "count", "score"])

# Convert to lists
data_x = list(df["week"])
data_hight = list(df["count"])
data_color = list(df["score"])

#Create Barplot:
data_color = [x / max(data_color) for x in data_color]
fig, ax = plt.subplots(figsize=(15, 4))

my_cmap = plt.cm.get_cmap('RdYlGn')
colors = my_cmap(data_color)
rects = ax.bar(data_x, data_hight, color=colors)

sm = ScalarMappable(cmap=my_cmap, norm=plt.Normalize(1,5))
sm.set_array([])

cbar = plt.colorbar(sm)
cbar.set_label('Color', rotation=270,labelpad=25)

plt.show()

Barplot

Now to the issue: As you might notice the value of the average score in week 4 is "1.2". The Barplot does however indicate that the value lies around "2.5". I understand that this stems from the following code line, which standardizes the values by dividing it with the max value:

data_color = [x / max(data_color) for x in data_color]

Unfortunatly I am not able to change this command in a way that the colors resemble the absolute values of the scores, e.g. with a average score of 1.2 the last bar should be colored in deep red not light orange. I tried to just plug in the regular score values (Not standardized) to solve the issue, however, doing so creates all bars with the same green color... Since this is only my second python project, I have a hard time comprehending the process behind this matter and would be very thankful for any advice or solution.

Cheers Neil

Upvotes: 1

Views: 1532

Answers (1)

Mr. T
Mr. T

Reputation: 12410

You identified correctly that the normalization is the problem here. It is in the linked code by valued SO user @ImportanceOfBeingEarnest defined for the interval [0, 1]. If you want another normalization range [normmin, normmax], you have to take this into account during the normalization:

# Import Packages
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.cm import ScalarMappable 

# Create Dataframe
data = [[1, 10, 3.4], [2, 15, 3.9], [3, 12, 3.6], [4, 30,1.2]] 
df = pd.DataFrame(data, columns = ["week", "mycount", "score"])
  
# Not necessary to convert to lists, pandas series or numpy array is also fine
data_x = df.week
data_hight = df.mycount
data_color = df.score

#Create Barplot:
normmin=1
normmax=5
data_color = [(x-normmin) / (normmax-normmin) for x in data_color] #see the difference here
fig, ax = plt.subplots(figsize=(15, 4))

my_cmap = plt.cm.get_cmap('RdYlGn')
colors = my_cmap(data_color)
rects = ax.bar(data_x, data_hight, color=colors)

sm = ScalarMappable(cmap=my_cmap, norm=plt.Normalize(normmin,normmax))
sm.set_array([])

cbar = plt.colorbar(sm)
cbar.set_label('Color', rotation=270,labelpad=25)

plt.show()

Sample output:

enter image description here

Obviously, this does not check that all values are indeed within the range [normmin, normmax], so a better script would make sure that all values adhere to this specification. We could, alternatively, address this problem by clipping the values that are outside the normalization range:

#...
import numpy as np
#.....
#Create Barplot:
normmin=1
normmax=3.5

data_color = [(x-normmin) / (normmax-normmin) for x in np.clip(data_color, normmin, normmax)]
#....

You may also have noticed another change that I introduced. You don't have to provide lists - pandas series or numpy arrays are fine, too. And if you name your columns not like pandas functions such as count, you can access them as df.ABC instead of df["ABC"].

Upvotes: 2

Related Questions