Pandas, plotly heatmaps and matrix

Question

I'm using python 3.8, plotly 4.14.1, pandas1.2.0

I can't work out how to separate my data in pandas and assign data to counters so I can update a heatmap.

I want to create a risk matrix of impact x likelihood and have those numbers shown on a Plotly heat map.

Hard coding the data into the dataframe and it works as expected

below with figure factory

gross_data=[[0,1,2,6,3], [0,7,18,12,6], [6,10,43,44,7], [3,15,29,46,18], [5,14,26,22,21]]

x=['Minor [Very Low]
1', 'Important [Low]
2', 'Significant [Moderate]
3', 'Major [High]
4', 'Critical [Very High]
5']
y=['Very Remote
1', 'Remote
2', 'Unlikely
3', 'Possible
4', 'Highly Possible
5']

fig = ff.create_annotated_heatmap(gross_data, x=x, y=y, colorscale='Magma')
fig['layout']['xaxis']['side'] = 'bottom'
fig.show()

Plotly Figfure Factory annotated heatmap

or doing it with plotly express

gross_data=[[0,1,2,6,3], [0,7,18,12,6], [6,10,43,44,7], [3,15,29,46,18], [5,14,26,22,21]]

fig = px.imshow(gross_data,
                labels=dict(x="Impact", y="Probability", color="Number of Risks"),
                x=['Minor [Very Low]', 'Important [Low]', 'Significant [Moderate]', 'Major [High]', 'Critical [Very High]'],
                y=['Very Remote', 'Remote', 'Unlikely', 'Possible', 'Highly Possible']
               )
fig.update_xaxes(side="bottom")

fig.show()

Plotly Heat Map

Now, these both would do for a one-off presentation but I'd like to be able to slice and dice my data and display it by a business unit etc.

I am using pandas to read in about 1000 lines of excel data which has been filled out as a form by human complete with all the errors that brings. I have tidied them up and cleaned the data.

What I can't work out is how to get the data required to provide sizes for each of the cells in the risk matrix without having a massive if statement going something like the following.

for X in df[gross_impact]:
    for y in df[gross_likelihood]:
        if (x == 1) & (y == 1):
            increment a counter associated with cell x1y1
       elif (x == 2) & (y == 1):
            increment a counter associated with cell x2y1
       elif (x == 3) & (y == 1):
            increment a counter associated with cell x3y1
       elif (x == 4) & (y == 1):
            increment a counter associated with cell x4y1
       elif (x == 4) & (y == 1):
            increment a counter associated with cell x5y1
       elif (x == 1) & (y == 2):
            increment a counter associated with cell x1y2
       elif (x == 2) & (y == 2):
            increment a counter associated with cell x2y2
      .....
      .....
      elif (x == 5) & (y == 5):
        increment a counter associated with cell x5y5

This is obviously wholly inefficient. id run through about 233,000 loops to get my results and whilst compute resource is cheap, that doesn't make it right.

I just can't fathom out how I should do this. I have read a couple of stack exchange questions but they did not address my query. I've researched around the net looking for risk matrixes but that comes back in c++ or financial data stuff that does not look applicable.

My data frame has about 30 fields but the unique reference is risk_id. My fields that I am multiplying against each other are risk_impact and risk_likelihood. and I could sort on risk_id and business_unit

I have to differentiate between 1 x 3 and 3 x 1 as that goes into two different bins Counting should be from bottom left to top right so 1 x 3 is column 1, row 3. 3 x 1 is column 3 row 1

Hopefully, that is clear. if I knew exactly what to ask for I may have done it

Any help with working this out would be greatly appreciated.

here is an example of the data i have in my datafrae

import pandas as pf

# Our data to be loaded into the data frame
data = {'risk_id': ['AP-P01-R01', 'AP-P02-R02', 'AP-P03-R03', 'AP-P01-R04', 'BP-P01-R01', 'BP-P01-R02', 'BP-P01-R03', 'BP-P01-R04', 'BP-P01-R05', 'BP-P01-R06', 'BP-P01-R07', 'CP-P01-R01', 'CP-P01-R02', 'CP-P01-R03', 'CP-P01-R04', 'CP-P01-R05', 'CP-P01-R06', 'CP-P01-R07', 'CP-P01-R08'],
        'gross_impact': [4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 3, 3, 4, 4, 2, 3, 5, 3, 2],
        'gross_likelihood': [3, 3, 3, 3, 3, 3, 3, 5, 5, 5, 2, 3, 3, 3, 3, 4, 4, 5, 3],
        'business_unit': ['Accounts Payable', 'Accounts Payable', 'Accounts Payable', 'Accounts Payable', 'British Petroleum', 'British Petroleum', 'British Petroleum', 'British Petroleum', 'British Petroleum', 'British Petroleum', 'British Petroleum', 'Client Profile', 'Client Profile', 'Client Profile', 'Client Profile', 'Client Profile', 'Client Profile', 'Client Profile', 'Client Profile']
        }

This all loads in fine

Using the solution provided worked perfectly on my limited dataset. When i created a mere pandas dataframe using my master file i get the following error

IndexError: arrays used as indices must be of integer (or boolean) type

The error appers when i run the code and its the following line that throws an error

heatmap[counts[:,1]-1, counts[:,0]-1] = counts[:,2]

the full code I am running is

df = raca_df[['risk_id', 'gross_impact','gross_likelihood', 'business_unit']].dropna()
counts = df.groupby(['gross_impact','gross_likelihood']).apply(len).reset_index().values

heatmap = np.zeros((np.max(5), np.max(5)))
#heatmap = np.zeros((np.max(df['gross_impact']), np.max(df['gross_likelihood'])))
heatmap[counts[:,1]-1, counts[:,0]-1] = counts[:,2]

import plotly.figure_factory as ff
fig = ff.create_annotated_heatmap(heatmap)
fig['layout']['xaxis']['side'] = 'bottom'
fig.show()

After being prompted on how i created raca df a quick raca_df.info pointed out my issue. This turned out to be a problem with my original raca_df dataframe and the column type being float64. I also had some blank entries in the column so it would not let me change the column type there.

I had to create a new data frame from raca_df imaginitivly called df and i changed the column types there using a loop and astype('int')

df = raca_df[['risk_id', 'gross_impact','gross_likelihood', 'business_unit']].dropna()

for item in ['gross_impact','gross_likelihood', 'net_impact', 'net_likelihood']:
    raca_df[item] = raca_df[item].astype('int')

warped · Accepted Answer

you can group by impact and likelihood and use groupsize to get your heatmap intensities:

counts = df.groupby(['gross_impact','gross_likelihood']).apply(len).reset_index().values

# then, you can initialise an empty heatmap and use fancy indexing to populate it:    
heatmap = np.zeros((np.max(df['gross_impact']), np.max(df['gross_likelihood'])))
heatmap[counts[:,0]-1, counts[:,1]-1] = counts[:,2]

# and finally plot it
import plotly.figure_factory as ff
fig = ff.create_annotated_heatmap(heatmap)
fig.show()

edit:

for individual business units, you can run the same code, with either of these tweaks:

# for a single business unit
counts = df[df['business_unit']=='Accounts Payable'].groupby(['gross_impact','gross_likelihood']).apply(len).reset_index().values

# and then the remaining code


# for all units, you can loop:
for business_unit in set(df['business unit']):
    counts = df[df['business unit']==business_unit].groupby(['gross_impact','gross_likelihood']).apply(len).reset_index().values
    
    # and then the remaining code

Pandas, plotly heatmaps and matrix

Answers (1)

Related Questions