Data preparation for Sankey Data in R to get flow frequency

Question

I have tried to create a Sankey Diagram using both the ggalluvial and networkd3 packages and failed to date. Ideally I would like to understand how to get what I want to do in both.

Data is generated as follows:

dat <- data.frame(customer = c(rep(c(1, 2), each=3), 3, 3),
              holiday_loc = c("SA", "SA", "AB", "SA", "SA", "SA", "AB", "AB"),
              holiday_num = c(1, 2, 3, 1, 2, 3, 1, 2))

dat_wide <- dat %>%
        spread(key=holiday_num, value=holiday_loc`)

Not sure whether dat or dat_wide is more appropriate? I want the output to visualise the following information (where the number in brackets is the frequency and therefore width of the flow)

SA -(2) - SA - (1) - AB
           - (1) - SA
AB -(1) - AB

I followed the instructions on this link for networkd3 Sankey diagram for Discrete State Sequences in R using networkd3, however I ended up with loops in the diagram.

A similar diagram of what I want is shown in the below image: [![Sankey Diagram taken from SAS VA][2]][2]

Suggestions and help will be greatly appreciated...

Thanks!

[2]: https://i.sstatic.net/wTJ1k.png enter image description here

CJ Yetman · Accepted Answer

The core problem with your data (in networkD3 terms) is that you have nodes with the same name, so you need to distinguish them, at least while you're processing the data.

Combine the location and the number information to make distinguishable nodes, then transform your data into a links data frame, like this...

links <- 
  dat %>% 
  mutate("source" = paste(holiday_loc, holiday_num, sep = "_")) %>% 
  group_by(customer) %>% 
  arrange(holiday_num) %>% 
  mutate("target" =  lead(source)) %>% 
  ungroup() %>% 
  arrange(customer) %>% 
  filter(!is.na(target)) %>% 
  select(source, target)

From that, you can build a nodes data frame which contains one row for each distinct node, like this...

node_names <- factor(sort(unique(c(as.character(links$source), 
                                   as.character(links$target)))))
nodes <- data.frame(name = node_names)

Then convert the links data frame to use the index (0-indexed because it ultimately gets passed to JavaScript) of the node in the nodes data frame, like this...

links <- data.frame(source = match(links$source, node_names) - 1, 
                    target = match(links$target, node_names) - 1,
                    value = 1)

At this point, if you want the nodes to have non-distinct names, you can change that now, like this...

nodes$name <- sub("_[0-9]$", "", nodes$name)

And now you can plot it...

library(networkD3)
sankeyNetwork(links, nodes, "source", "target", "value", "name")

Data preparation for Sankey Data in R to get flow frequency

Answers (2)

Related Questions