Reputation: 55

Individual y-axis for each level in parallel coordinates plot

I'm trying to generate a parallel coordinates plot, where each variable has its own axis. For instance:

So far, I've used the function ggparcoord() from the package GGally. However, as far as I can tell, it does not allow each variable to have its own axis as above.

Does anyone know how this may be done using R, preferably using ggplot2? Thanks in advance.

Upvotes: 0

Answers (3)

otter_interested_in_nature

Reputation: 23

If someone finds the question on google like I did, the easy to use function from @MattSainsbury-Dale answer produces some warnings because of deprecated functions:

Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0
Warning: Returning more (or less) than 1 row per summarise() group was deprecated in dplyr 1.1.0.
Warning: aes_string() was deprecated in ggplot2 3.0.0.

I fixed those as I might be using this on a Shiny app and thought I'd share it here.

#Original function by @MattSainsbury-Dale from https://stackoverflow.com/questions/69694711/individual-y-axis-for-each-level-in-parallel-coordinates-plot 

library(reshape2)
ggparcoord_ind_yaxis <- function(
    data,
    truth = NULL, 
    truthPointSize = 2, 
    columns = 1:ncol(data),
    groupColumn = NULL, 
    alphaLines = 1, 
    nbreaks = 4, 
    axis_font_size = 3
) {
  
  # select the variables to plot
  data_subset <- data %>% select(all_of(columns))
  
  # re-order truth to match columns
  col_names <- data_subset %>% names
  if (!is.null(truth)) {
    truth <- truth %>% select(all_of(col_names))
    data_subset <- data_subset %>% rbind(truth)
  } 
  
  # Calculate the axis breaks for each variable on the *original* scale.
  # Note that the breaks computed by pretty() are guaranteed to contain all of 
  # the data. We include truth in these breaks, just in case one of the true 
  # points falls outside the range of the data (can easily happen in the context
  # of comparing parameter estimates to the true values).
  breaks_df <- data_subset %>% 
    stack %>%           # convert to long format
    group_by(ind) %>%   # group by the plotting variables
    reframe(breaks = pretty(values, n = nbreaks))
  
  # Normalise the breaks to be between 0 and 1, and set the coordinates of the 
  # tick marks. Importantly, if we want the axis heights to be the same, the 
  # breaks need to be normalised to be between exactly 0 and 1. 
  axis_df <- breaks_df %>% 
    group_by(ind) %>%
    mutate(yval = (breaks - min(breaks))/(max(breaks) - min(breaks))) %>%
    mutate(xmin = as.numeric(ind) - 0.05, 
           xmax = as.numeric(ind),
           x_text = as.numeric(ind) - 0.2)
  
  # Calculate the co-ordinates for our axis lines:
  axis_line_df <- axis_df %>% 
    group_by(ind) %>%
    summarize(min = min(yval), max = max(yval))
  
  # Getting the minimum/maximum breaks on the original scale, to scale the 
  # data in the same manner that we scaled the breaks
  minmax_breaks <- breaks_df %>%
    group_by(ind) %>%
    summarize(min_break = min(breaks), max_break = max(breaks)) %>% 
    tibble::column_to_rownames(var = "ind")
  
  # Normalise the original data in the same way that the breaks were normalised.
  # This ensures that the scaling is correct. 
  # Do the same for the truth points, if they exist.
  lines_df <- data %>% select(all_of(columns)) 
  for (col in col_names) {
    lines_df[, col] <- (lines_df[, col] - minmax_breaks[col, "min_break"]) / ( minmax_breaks[col, "max_break"] -  minmax_breaks[col, "min_break"])
    if (!is.null(truth)) {
      truth[, col] <- (truth[, col] - minmax_breaks[col, "min_break"]) / ( minmax_breaks[col, "max_break"] -  minmax_breaks[col, "min_break"])
    }
  }
  
  # Reshape original data (and truth):
  lines_df <- lines_df %>%
    mutate(row = row_number()) %>% # need row information to group individual rows
    bind_cols(data[, groupColumn, drop = FALSE]) %>% # need groupColumn for colour aesthetic
    reshape2::melt(id.vars = c("row", groupColumn), 
                   # choose names that are consistent with stack() above:
                   value.name = "values", variable.name = "ind") 
  
  # Reshape truth, as above
  if (!is.null(truth)) {
    truth <- truth %>%
      mutate(row = row_number()) %>% # need row information to group individual rows
      reshape2::melt(id.vars = c("row"), 
                     # choose names that are consistent with stack():
                     value.name = "values", variable.name = "ind") 
  }
  
  # Now plot: 
  gg <- ggplot() + 
    geom_line(data = lines_df %>% sample_n(nrow(.)), # permute rows to prevent one group dominating over another
              aes(x = ind, y = values, group = row, colour = get(groupColumn), linetype = get(groupColumn)), 
              alpha = alphaLines) +
    geom_segment(data = axis_line_df, aes(x = ind, xend = ind, y = min, yend = max),
                 inherit.aes = FALSE) +
    geom_segment(data = axis_df, aes(x = xmin, xend = xmax, y = yval, yend = yval),
                 inherit.aes = FALSE) +
    geom_text(data = axis_df, aes(x = x_text, y = yval, label = breaks),
              inherit.aes = FALSE, size = axis_font_size) 
  
  if (!is.null(truth)) {
    gg <- gg + geom_point(data = truth, aes(x = ind, y = values), 
                          inherit.aes = FALSE, colour = "black", size = truthPointSize)
  }
  
  gg <- gg + theme_bw() + 
    theme(panel.grid = element_blank(), 
          panel.border = element_blank(), 
          axis.title = element_blank(),
          axis.ticks =  element_blank(), 
          axis.text.y = element_blank()) 
  
  return(gg)
}

Upvotes: 1

MattSainsbury-Dale

Reputation: 55

Thanks again to @Allan Cameron for his excellent answer. I used his code to write a function that mimics GGally::ggparcoord(), but with individual y-axes. Here, the normalisation of the axis breaks and the data is done so that the heights of the axes are identical.

I also added an argument truth, which is an optional data.frame containing points to plot for each variable; in the context of my application, the lines correspond to parameter estimates, and the truth points are the true values we are trying to estimate.

Here is the function:

ggparcoord_ind_yaxis <- function(
  data,
  truth = NULL, 
  truthPointSize = 2, 
  columns = 1:ncol(data),
  groupColumn = NULL, 
  alphaLines = 1, 
  nbreaks = 4, 
  axis_font_size = 3
) {
  
  # select the variables to plot
  data_subset <- data %>% select(columns)
  
  # re-order truth to match columns
  col_names <- data_subset %>% names
  if (!is.null(truth)) {
    truth <- truth %>% select(col_names)
    data_subset <- data_subset %>% rbind(truth)
  } 
  
  # Calculate the axis breaks for each variable on the *original* scale.
  # Note that the breaks computed by pretty() are guaranteed to contain all of 
  # the data. We include truth in these breaks, just in case one of the true 
  # points falls outside the range of the data (can easily happen in the context
  # of comparing parameter estimates to the true values).
  breaks_df <- data_subset %>% 
    stack %>%           # convert to long format
    group_by(ind) %>%   # group by the plotting variables
    summarize(breaks = pretty(values, n = nbreaks))
  
  # Normalise the breaks to be between 0 and 1, and set the coordinates of the 
  # tick marks. Importantly, if we want the axis heights to be the same, the 
  # breaks need to be normalised to be between exactly 0 and 1. 
  axis_df <- breaks_df %>% 
    mutate(yval = (breaks - min(breaks))/(max(breaks) - min(breaks))) %>%
    mutate(xmin = as.numeric(ind) - 0.05, 
           xmax = as.numeric(ind),
           x_text = as.numeric(ind) - 0.2)
  
  # Calculate the co-ordinates for our axis lines:
  axis_line_df <- axis_df %>% 
    group_by(ind) %>%
    summarize(min = min(yval), max = max(yval))
  
  # Getting the minimum/maximum breaks on the original scale, to scale the 
  # data in the same manner that we scaled the breaks
  minmax_breaks <- breaks_df %>%
    summarize(min_break = min(breaks), max_break = max(breaks)) %>% 
    tibble::column_to_rownames(var = "ind")
  
  # Normalise the original data in the same way that the breaks were normalised.
  # This ensures that the scaling is correct. 
  # Do the same for the truth points, if they exist.
  lines_df <- data %>% select(columns) 
  for (col in col_names) {
    lines_df[, col] <- (lines_df[, col] - minmax_breaks[col, "min_break"]) / ( minmax_breaks[col, "max_break"] -  minmax_breaks[col, "min_break"])
    if (!is.null(truth)) {
      truth[, col] <- (truth[, col] - minmax_breaks[col, "min_break"]) / ( minmax_breaks[col, "max_break"] -  minmax_breaks[col, "min_break"])
    }
  }
  
  # Reshape original data (and truth):
  lines_df <- lines_df %>%
    mutate(row = row_number()) %>% # need row information to group individual rows
    bind_cols(data[, groupColumn, drop = FALSE]) %>% # need groupColumn for colour aesthetic
    reshape2::melt(id.vars = c("row", groupColumn), 
                   # choose names that are consistent with stack() above:
                   value.name = "values", variable.name = "ind") 
  
  # Reshape truth, as above
  if (!is.null(truth)) {
    truth <- truth %>%
      mutate(row = row_number()) %>% # need row information to group individual rows
      reshape2::melt(id.vars = c("row"), 
                     # choose names that are consistent with stack():
                     value.name = "values", variable.name = "ind") 
  }
  
  # Now plot: 
  gg <- ggplot() + 
    geom_line(data = lines_df %>% sample_n(nrow(.)), # permute rows to prevent one group dominating over another
              aes_string(x = "ind", y = "values", group = "row", colour = groupColumn), 
              alpha = alphaLines) +
    geom_segment(data = axis_line_df, aes(x = ind, xend = ind, y = min, yend = max),
                 inherit.aes = FALSE) +
    geom_segment(data = axis_df, aes(x = xmin, xend = xmax, y = yval, yend = yval),
                 inherit.aes = FALSE) +
    geom_text(data = axis_df, aes(x = x_text, y = yval, label = breaks),
              inherit.aes = FALSE, size = axis_font_size) 
  
  if (!is.null(truth)) {
    gg <- gg + geom_point(data = truth, aes(x = ind, y = values), 
                          inherit.aes = FALSE, colour = "red", size = truthPointSize)
  }
  
  gg <- gg + theme_bw() + 
    theme(panel.grid = element_blank(), 
          panel.border = element_blank(), 
          axis.title = element_blank(),
          axis.ticks =  element_blank(), 
          axis.text.y = element_blank()) 
  
  return(gg)
}

An example using the iris data set:

library("ggplot2")
library("dplyr")
library("tibble")

truth <- iris %>% select(4:1) %>% apply(2, median, simplify = FALSE) %>% data.frame

ggparcoord_ind_yaxis(iris, truth = truth, columns = 4:1, groupColumn = "Species", alphaLines = 0.5)

Upvotes: 1

Allan Cameron

Reputation: 173858

I'm not aware of any packages that can do this, but it's not too difficult to draw the axes yourself in ggplot.

Let's say we have a similar dataset to the one shown in your example plot:

library(ggplot2)
library(dplyr)

cars <- mtcars %>% 
          select(c(2:4, 6:7, 1)) %>%
          tibble::rownames_to_column("model") %>%
  as_tibble()

cars
#> # A tibble: 32 x 7
#>    model               cyl  disp    hp    wt  qsec   mpg
#>    <chr>             <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1 Mazda RX4             6  160    110  2.62  16.5  21  
#>  2 Mazda RX4 Wag         6  160    110  2.88  17.0  21  
#>  3 Datsun 710            4  108     93  2.32  18.6  22.8
#>  4 Hornet 4 Drive        6  258    110  3.22  19.4  21.4
#>  5 Hornet Sportabout     8  360    175  3.44  17.0  18.7
#>  6 Valiant               6  225    105  3.46  20.2  18.1
#>  7 Duster 360            8  360    245  3.57  15.8  14.3
#>  8 Merc 240D             4  147.    62  3.19  20    24.4
#>  9 Merc 230              4  141.    95  3.15  22.9  22.8
#> 10 Merc 280              6  168.   123  3.44  18.3  19.2
#> # ... with 22 more rows

We can calculate the axis breaks (and set the coordinates of the tick marks) using some simple arithmetic:

axis_df <- stack(cars[-1]) %>% 
             group_by(ind) %>% 
             summarize(breaks = pretty(values, n = 10),
                       yval = (breaks - min(breaks))/(max(values) - min(values))) %>%
             mutate(xmin = as.numeric(ind) - 0.05, 
                    xmax = as.numeric(ind),
                    x_text = as.numeric(ind) - 0.2)

And the co-ordinates for our actual axis lines like this:

axis_line_df <- axis_df %>% 
                  group_by(ind) %>%
                  summarize(min = min(yval), max = max(yval))

Now we need to reshape and normalize the original data:

lines_df <- cars[-1] %>%
   mutate(across(everything(), function(x) (x - min(x))/(max(x) - min(x)))) %>%
   stack() %>%
   mutate(row = rep(cars$model, ncol(cars) - 1))

Finally, the plotting code would be something like:

ggplot(lines_df, aes(ind, values, group = row)) + 
  geom_line(color = "orange", alpha = 0.5) +
  geom_segment(data = axis_line_df, aes(x = ind, xend = ind, y = min, yend = max),
               inherit.aes = FALSE) +
  geom_segment(data = axis_df, aes(x = xmin, xend = xmax, y = yval, yend = yval),
               inherit.aes = FALSE) +
  geom_text(data = axis_df, aes(x = x_text, y = yval, label = breaks),
            inherit.aes = FALSE) +
  geom_text(data = axis_line_df, aes(x = ind, y = 1.2, label = ind),
            size = 6, inherit.aes = FALSE, check_overlap = TRUE, hjust = 1) +
  theme_void() +
  theme(plot.margin = margin(50, 20, 50, 20))

^{Created on 2021-10-24 by the reprex package (v2.0.0)}

Upvotes: 1

Individual y-axis for each level in parallel coordinates plot

Answers (3)

Related Questions