Reorder Sankey diagram vertically based on label value

Question

I'm trying to plot patient flows between 3 clusters in a Sankey diagram. I have a pd.DataFrame counts with from-to values, see below. To reproduce this DF, here is the counts dict that should be loaded into a pd.DataFrame (which is the input for the visualize_cluster_flow_counts function).

    from    to      value
0   C1_1    C1_2    867
1   C1_1    C2_2    405
2   C1_1    C0_2    2
3   C2_1    C1_2    46
4   C2_1    C2_2    458
... ... ... ...
175 C0_20   C0_21   130
176 C0_20   C2_21   1
177 C2_20   C1_21   12
178 C2_20   C0_21   0
179 C2_20   C2_21   96

The from and to values in the DataFrame represent the cluster number (either 0, 1, or 2) and the amount of days for the x-axis (between 1 and 21). If I plot the Sankey diagram with these values, this is the result:

Code:

import plotly.graph_objects as go

def visualize_cluster_flow_counts(counts):
    all_sources = list(set(counts['from'].values.tolist() + counts['to'].values.tolist()))            
    
    froms, tos, vals, labs = [], [], [], []
    for index, row in counts.iterrows():
        froms.append(all_sources.index(row.values[0]))
        tos.append(all_sources.index(row.values[1]))
        vals.append(row[2])
        labs.append(row[3])
                
    fig = go.Figure(data=[go.Sankey(
        arrangement='snap',
        node = dict(
          pad = 15,
          thickness = 5,
          line = dict(color = "black", width = 0.1),
          label = all_sources,
          color = "blue"
        ),
        link = dict(
          source = froms,
          target = tos,
          value = vals,
          label = labs
      ))])

    fig.update_layout(title_text="Patient flow between clusters over time: 48h (2 days) - 504h (21 days)", font_size=10)
    fig.show()

visualize_cluster_flow_counts(counts)

However, I would like to vertically order the bars so that the C0's are always on top, the C1's are always in the middle, and the C2's are always at the bottom (or the other way around, doesn't matter). I know that we can set node.x and node.y to manually assign the coordinates. So, I set the x-values to the amount of days * (1/range of days), which is an increment of +- 0.045. And I set the y-values based on the cluster value: either 0, 0.5 or 1. I then obtain the image below. The vertical order is good, but the vertical margins between the bars are obviously way off; they should be similar to the first result.

The code to produce this is:

import plotly.graph_objects as go

def find_node_coordinates(sources):
    x_nodes, y_nodes = [], []
    
    for s in sources:
        # Shift each x with +- 0.045
        x = float(s.split("_")[-1]) * (1/21)
        x_nodes.append(x)
        
        # Choose either 0, 0.5 or 1 for the y-value
        cluster_number = s[1]
        if cluster_number == "0": y = 1
        elif cluster_number == "1": y = 0.5
        else: y = 1e-09
        
        y_nodes.append(y)
                
    return x_nodes, y_nodes


def visualize_cluster_flow_counts(counts):
    all_sources = list(set(counts['from'].values.tolist() + counts['to'].values.tolist()))    
        
    node_x, node_y = find_node_coordinates(all_sources)
    
    froms, tos, vals, labs = [], [], [], []
    for index, row in counts.iterrows():
        froms.append(all_sources.index(row.values[0]))
        tos.append(all_sources.index(row.values[1]))
        vals.append(row[2])
        labs.append(row[3])
                
    fig = go.Figure(data=[go.Sankey(
        arrangement='snap',
        node = dict(
          pad = 15,
          thickness = 5,
          line = dict(color = "black", width = 0.1),
          label = all_sources,
          color = "blue",
          x = node_x,
          y = node_y,
        ),
        link = dict(
          source = froms,
          target = tos,
          value = vals,
          label = labs
      ))])

    fig.update_layout(title_text="Patient flow between clusters over time: 48h (2 days) - 504h (21 days)", font_size=10)
    fig.show()
    
    
visualize_cluster_flow_counts(counts)

Question: how do I fix the margins of the bars, so that the result looks like the first result? So, for clarity: the bars should be pushed to the bottom. Or is there another way that the Sankey diagram can vertically re-order the bars automatically based on the label value?

Parano · Accepted Answer

Firstly I don't think there is a way with the current exposed API to achieve your goal smoothly you can check the source code here.

Try to change your find_node_coordinates function as follows (note that you should pass the counts DataFrame to):

counts = pd.DataFrame(counts_dict) 
def find_node_coordinates(sources, counts):
    x_nodes, y_nodes = [], []

    flat_on_top = False
    range = 1 # The y range
    total_margin_width = 0.15
    y_range = 1 - total_margin_width 
    margin = total_margin_width / 2 # From number of  Cs
    srcs = counts['from'].values.tolist()
    dsts = counts['to'].values.tolist() 
    values = counts['value'].values.tolist() 
    max_acc = 0

    def _calc_day_flux(d=1):
        _max_acc = 0 
        for i in [0,1,2]:
            # The first ones
            from_source = 'C{}_{}'.format(i,d) 
            indices = [i for i, val in enumerate(srcs) if val == from_source]
            for j in indices: 
                _max_acc += values[j]
        
        return _max_acc

    def _calc_node_io_flux(node_str): 
        c,d = int(node_str.split('_')[0][-1]), int(node_str.split('_')[1])
        _flux_src = 0 
        _flux_dst = 0 

        indices_src = [i for i, val in enumerate(srcs) if val == node_str]
        indices_dst = [j for j, val in enumerate(dsts) if val == node_str]
        for j in indices_src: 
            _flux_src += values[j]
        for j in indices_dst: 
            _flux_dst += values[j]

        return max(_flux_dst, _flux_src) 

    max_acc = _calc_day_flux() 
    graph_unit_per_val = y_range / max_acc
    print("Graph Unit per Acc Val", graph_unit_per_val) 
 
    
    for s in sources:
        # Shift each x with +- 0.045
        d = int(s.split("_")[-1])
        x = float(d) * (1/21)
        x_nodes.append(x)
        
        print(s, _calc_node_io_flux(s))
        # Choose either 0, 0.5 or 1 for the y-v alue
        cluster_number = s[1]

        
        # Flat on Top
        if flat_on_top: 
            if cluster_number == "0": 
              y = _calc_node_io_flux('C{}_{}'.format(2, d))*graph_unit_per_val + margin + _calc_node_io_flux('C{}_{}'.format(1, d))*graph_unit_per_val + margin +  _calc_node_io_flux('C{}_{}'.format(0, d))*graph_unit_per_val/2
            elif cluster_number == "1": y = _calc_node_io_flux('C{}_{}'.format(2, d))*graph_unit_per_val + margin +  _calc_node_io_flux('C{}_{}'.format(1, d))*graph_unit_per_val/2
            else: y = 1e-09
        # Flat On Bottom
        else: 
            if cluster_number == "0": y = 1 - (_calc_node_io_flux('C{}_{}'.format(0,d))*graph_unit_per_val / 2)
            elif cluster_number == "1": y = 1 - (_calc_node_io_flux('C{}_{}'.format(0,d))*graph_unit_per_val + margin + _calc_node_io_flux('C{}_{}'.format(1,d)) * graph_unit_per_val /2 )
            elif cluster_number == "2": y = 1 - (_calc_node_io_flux('C{}_{}'.format(0,d))*graph_unit_per_val + margin + _calc_node_io_flux('C{}_{}'.format(1,d)) * graph_unit_per_val + margin + _calc_node_io_flux('C{}_{}'.format(2,d)) * graph_unit_per_val /2 )
            
        y_nodes.append(y)
                
    return x_nodes, y_nodes

Sankey graphs supposed to weigh their connection width by their corresponding normalized values right? Here I do the same, first, it calculates each node flux, later by calculating the normalized coordinate the center of each node calculated according to their flux.

Here is the sample output of your code with the modified function, note that I tried to adhere to your code as much as possible so it's a bit unoptimized(for example, one could store the values of nodes above each specified source node to avoid its flux recalculation).

With flag flat_on_top = True

With flag flat_on_top = False

There is a bit of inconsistency in the flat_on_bottom version which I think is caused by the padding or other internal sources of Plotly API.

Reorder Sankey diagram vertically based on label value

Answers (1)

Related Questions