Jorge Gomes
Jorge Gomes

Reputation: 314

Multistep Sankey Graph from Dataframe

I have a dataframe with the following structure

INDEX ANO DISTRITO CONCELHO NCCO
0 2013.0 Aveiro Albergaria-a-Velha 98
1 2013.0 Aveiro Albergaria-a-velha 1
2 2013.0 Aveiro Anadia 41

The full dataset can be found here

This data set ranges from 2013 to 2022 (ANO), and includes 18 different districts (DISTRITO), 278 different counties (CONCELHO) and the number of forest fires per CONCELHO (`NCCO´)

I'm able to produce a one step Sankey graph with this code, that I adapted from here

df = pd.read_csv('heatmap_full.csv') #generated by ingestor.py

all_nodes = df.ANO.values.tolist() + df.DISTRITO.values.tolist() 
source_indices = [all_nodes.index(ANO) for ANO in df.ANO]
target_indices = [all_nodes.index(DISTRITO) for DISTRITO in df.DISTRITO]

colors = px.colors.qualitative.D3
node_colors = [np.random.choice(colors) for node in all_nodes]

fig = go.Figure(data=[go.Sankey(
    # Define nodes
    node = dict(
    pad = 20,
    thickness = 20,
    line = dict(color = "black", width = 1.0),
    label =  all_nodes,
    color =  node_colors,
    ),

    # Add links
    link = dict(
      source =  source_indices,
      target =  target_indices,
      value =  df.NCCO,
))])

fig.update_layout(title_text="FOREST FIRES IN PORTUGAL",
                height = 900,
                width=1200,
                font_size=18)
fig.show()

One Step Sankey

My Problem/Question

I would like to have a step after DISTRITO for CONCELHO appearing in the Sankey graph, but I can't figure it out.

Can I add a new trace to the figure? Do I need to treat my original dataset in another way?

Any help would be much appreciated

Disclosure This is not meant for commercial use.

Upvotes: 0

Views: 2318

Answers (1)

Rob Raymond
Rob Raymond

Reputation: 31146

  • reusing this answer to build a Sankey Diagram plotly sankey graph data formatting
  • build data frame of source and target values. Note two data cleanups
    1. there are duplicate CONCELHO due to capitalisation
    2. there are same values in CONCELHO and DISTRITO. modify so that no circular items in sankey
  • as per comments there really are too many nodes to represent in a sankey
import pandas as pd
import numpy as np
import plotly.graph_objects as go

df_in = pd.read_csv("https://raw.githubusercontent.com/vostpt/ICNF_DATA/main/heatmap_full.csv")

# too much data
df_in = df_in.sample(100)

# cleanup where same values exist in two columns
df_in["CONCELHO"] = np.where(df_in["DISTRITO"]==df_in["CONCELHO"], df_in["CONCELHO"]+"_c", df_in["CONCELHO"])
# deal with some duplicates names across source and target...
df_in["CONCELHO"] = df_in["CONCELHO"].str.capitalize()
df = df_in.groupby(["ANO","DISTRITO"], as_index=False)["NCCO"].sum().rename(columns={"ANO":"source","DISTRITO":"target","NCCO":"value"})
df["source"] = df["source"].astype(int).astype(str)

df = pd.concat([df, df_in.groupby(["DISTRITO","CONCELHO"], as_index=False)["NCCO"].sum().rename(columns={"DISTRITO":"source","CONCELHO":"target", "NCCO":"value"})])

nodes = np.unique(df[["source","target"]], axis=None)
nodes = pd.Series(index=nodes, data=range(len(nodes)))

go.Figure(
    go.Sankey(
        node={"label": nodes.index},
        link={
            "source": nodes.loc[df["source"]],
            "target": nodes.loc[df["target"]],
            "value": df["value"],
        },
    )
)

enter image description here

Upvotes: 1

Related Questions