Reputation: 55
I'm trying to generate a parallel coordinates plot, where each variable has its own axis. For instance:
So far, I've used the function ggparcoord()
from the package GGally
. However, as far as I can tell, it does not allow each variable to have its own axis as above.
Does anyone know how this may be done using R
, preferably using ggplot2
? Thanks in advance.
Upvotes: 0
Views: 1046
Reputation: 23
If someone finds the question on google like I did, the easy to use function from @MattSainsbury-Dale answer produces some warnings because of deprecated functions:
summarise()
group was deprecated in dplyr 1.1.0.aes_string()
was deprecated in ggplot2 3.0.0.I fixed those as I might be using this on a Shiny app and thought I'd share it here.
#Original function by @MattSainsbury-Dale from https://stackoverflow.com/questions/69694711/individual-y-axis-for-each-level-in-parallel-coordinates-plot
library(reshape2)
ggparcoord_ind_yaxis <- function(
data,
truth = NULL,
truthPointSize = 2,
columns = 1:ncol(data),
groupColumn = NULL,
alphaLines = 1,
nbreaks = 4,
axis_font_size = 3
) {
# select the variables to plot
data_subset <- data %>% select(all_of(columns))
# re-order truth to match columns
col_names <- data_subset %>% names
if (!is.null(truth)) {
truth <- truth %>% select(all_of(col_names))
data_subset <- data_subset %>% rbind(truth)
}
# Calculate the axis breaks for each variable on the *original* scale.
# Note that the breaks computed by pretty() are guaranteed to contain all of
# the data. We include truth in these breaks, just in case one of the true
# points falls outside the range of the data (can easily happen in the context
# of comparing parameter estimates to the true values).
breaks_df <- data_subset %>%
stack %>% # convert to long format
group_by(ind) %>% # group by the plotting variables
reframe(breaks = pretty(values, n = nbreaks))
# Normalise the breaks to be between 0 and 1, and set the coordinates of the
# tick marks. Importantly, if we want the axis heights to be the same, the
# breaks need to be normalised to be between exactly 0 and 1.
axis_df <- breaks_df %>%
group_by(ind) %>%
mutate(yval = (breaks - min(breaks))/(max(breaks) - min(breaks))) %>%
mutate(xmin = as.numeric(ind) - 0.05,
xmax = as.numeric(ind),
x_text = as.numeric(ind) - 0.2)
# Calculate the co-ordinates for our axis lines:
axis_line_df <- axis_df %>%
group_by(ind) %>%
summarize(min = min(yval), max = max(yval))
# Getting the minimum/maximum breaks on the original scale, to scale the
# data in the same manner that we scaled the breaks
minmax_breaks <- breaks_df %>%
group_by(ind) %>%
summarize(min_break = min(breaks), max_break = max(breaks)) %>%
tibble::column_to_rownames(var = "ind")
# Normalise the original data in the same way that the breaks were normalised.
# This ensures that the scaling is correct.
# Do the same for the truth points, if they exist.
lines_df <- data %>% select(all_of(columns))
for (col in col_names) {
lines_df[, col] <- (lines_df[, col] - minmax_breaks[col, "min_break"]) / ( minmax_breaks[col, "max_break"] - minmax_breaks[col, "min_break"])
if (!is.null(truth)) {
truth[, col] <- (truth[, col] - minmax_breaks[col, "min_break"]) / ( minmax_breaks[col, "max_break"] - minmax_breaks[col, "min_break"])
}
}
# Reshape original data (and truth):
lines_df <- lines_df %>%
mutate(row = row_number()) %>% # need row information to group individual rows
bind_cols(data[, groupColumn, drop = FALSE]) %>% # need groupColumn for colour aesthetic
reshape2::melt(id.vars = c("row", groupColumn),
# choose names that are consistent with stack() above:
value.name = "values", variable.name = "ind")
# Reshape truth, as above
if (!is.null(truth)) {
truth <- truth %>%
mutate(row = row_number()) %>% # need row information to group individual rows
reshape2::melt(id.vars = c("row"),
# choose names that are consistent with stack():
value.name = "values", variable.name = "ind")
}
# Now plot:
gg <- ggplot() +
geom_line(data = lines_df %>% sample_n(nrow(.)), # permute rows to prevent one group dominating over another
aes(x = ind, y = values, group = row, colour = get(groupColumn), linetype = get(groupColumn)),
alpha = alphaLines) +
geom_segment(data = axis_line_df, aes(x = ind, xend = ind, y = min, yend = max),
inherit.aes = FALSE) +
geom_segment(data = axis_df, aes(x = xmin, xend = xmax, y = yval, yend = yval),
inherit.aes = FALSE) +
geom_text(data = axis_df, aes(x = x_text, y = yval, label = breaks),
inherit.aes = FALSE, size = axis_font_size)
if (!is.null(truth)) {
gg <- gg + geom_point(data = truth, aes(x = ind, y = values),
inherit.aes = FALSE, colour = "black", size = truthPointSize)
}
gg <- gg + theme_bw() +
theme(panel.grid = element_blank(),
panel.border = element_blank(),
axis.title = element_blank(),
axis.ticks = element_blank(),
axis.text.y = element_blank())
return(gg)
}
Upvotes: 1
Reputation: 55
Thanks again to @Allan Cameron for his excellent answer. I used his code to write a function that mimics GGally::ggparcoord()
, but with individual y-axes. Here, the normalisation of the axis breaks and the data is done so that the heights of the axes are identical.
I also added an argument truth
, which is an optional data.frame
containing points to plot for each variable; in the context of my application, the lines correspond to parameter estimates, and the truth
points are the true values we are trying to estimate.
Here is the function:
ggparcoord_ind_yaxis <- function(
data,
truth = NULL,
truthPointSize = 2,
columns = 1:ncol(data),
groupColumn = NULL,
alphaLines = 1,
nbreaks = 4,
axis_font_size = 3
) {
# select the variables to plot
data_subset <- data %>% select(columns)
# re-order truth to match columns
col_names <- data_subset %>% names
if (!is.null(truth)) {
truth <- truth %>% select(col_names)
data_subset <- data_subset %>% rbind(truth)
}
# Calculate the axis breaks for each variable on the *original* scale.
# Note that the breaks computed by pretty() are guaranteed to contain all of
# the data. We include truth in these breaks, just in case one of the true
# points falls outside the range of the data (can easily happen in the context
# of comparing parameter estimates to the true values).
breaks_df <- data_subset %>%
stack %>% # convert to long format
group_by(ind) %>% # group by the plotting variables
summarize(breaks = pretty(values, n = nbreaks))
# Normalise the breaks to be between 0 and 1, and set the coordinates of the
# tick marks. Importantly, if we want the axis heights to be the same, the
# breaks need to be normalised to be between exactly 0 and 1.
axis_df <- breaks_df %>%
mutate(yval = (breaks - min(breaks))/(max(breaks) - min(breaks))) %>%
mutate(xmin = as.numeric(ind) - 0.05,
xmax = as.numeric(ind),
x_text = as.numeric(ind) - 0.2)
# Calculate the co-ordinates for our axis lines:
axis_line_df <- axis_df %>%
group_by(ind) %>%
summarize(min = min(yval), max = max(yval))
# Getting the minimum/maximum breaks on the original scale, to scale the
# data in the same manner that we scaled the breaks
minmax_breaks <- breaks_df %>%
summarize(min_break = min(breaks), max_break = max(breaks)) %>%
tibble::column_to_rownames(var = "ind")
# Normalise the original data in the same way that the breaks were normalised.
# This ensures that the scaling is correct.
# Do the same for the truth points, if they exist.
lines_df <- data %>% select(columns)
for (col in col_names) {
lines_df[, col] <- (lines_df[, col] - minmax_breaks[col, "min_break"]) / ( minmax_breaks[col, "max_break"] - minmax_breaks[col, "min_break"])
if (!is.null(truth)) {
truth[, col] <- (truth[, col] - minmax_breaks[col, "min_break"]) / ( minmax_breaks[col, "max_break"] - minmax_breaks[col, "min_break"])
}
}
# Reshape original data (and truth):
lines_df <- lines_df %>%
mutate(row = row_number()) %>% # need row information to group individual rows
bind_cols(data[, groupColumn, drop = FALSE]) %>% # need groupColumn for colour aesthetic
reshape2::melt(id.vars = c("row", groupColumn),
# choose names that are consistent with stack() above:
value.name = "values", variable.name = "ind")
# Reshape truth, as above
if (!is.null(truth)) {
truth <- truth %>%
mutate(row = row_number()) %>% # need row information to group individual rows
reshape2::melt(id.vars = c("row"),
# choose names that are consistent with stack():
value.name = "values", variable.name = "ind")
}
# Now plot:
gg <- ggplot() +
geom_line(data = lines_df %>% sample_n(nrow(.)), # permute rows to prevent one group dominating over another
aes_string(x = "ind", y = "values", group = "row", colour = groupColumn),
alpha = alphaLines) +
geom_segment(data = axis_line_df, aes(x = ind, xend = ind, y = min, yend = max),
inherit.aes = FALSE) +
geom_segment(data = axis_df, aes(x = xmin, xend = xmax, y = yval, yend = yval),
inherit.aes = FALSE) +
geom_text(data = axis_df, aes(x = x_text, y = yval, label = breaks),
inherit.aes = FALSE, size = axis_font_size)
if (!is.null(truth)) {
gg <- gg + geom_point(data = truth, aes(x = ind, y = values),
inherit.aes = FALSE, colour = "red", size = truthPointSize)
}
gg <- gg + theme_bw() +
theme(panel.grid = element_blank(),
panel.border = element_blank(),
axis.title = element_blank(),
axis.ticks = element_blank(),
axis.text.y = element_blank())
return(gg)
}
An example using the iris
data set:
library("ggplot2")
library("dplyr")
library("tibble")
truth <- iris %>% select(4:1) %>% apply(2, median, simplify = FALSE) %>% data.frame
ggparcoord_ind_yaxis(iris, truth = truth, columns = 4:1, groupColumn = "Species", alphaLines = 0.5)
Upvotes: 1
Reputation: 173858
I'm not aware of any packages that can do this, but it's not too difficult to draw the axes yourself in ggplot.
Let's say we have a similar dataset to the one shown in your example plot:
library(ggplot2)
library(dplyr)
cars <- mtcars %>%
select(c(2:4, 6:7, 1)) %>%
tibble::rownames_to_column("model") %>%
as_tibble()
cars
#> # A tibble: 32 x 7
#> model cyl disp hp wt qsec mpg
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Mazda RX4 6 160 110 2.62 16.5 21
#> 2 Mazda RX4 Wag 6 160 110 2.88 17.0 21
#> 3 Datsun 710 4 108 93 2.32 18.6 22.8
#> 4 Hornet 4 Drive 6 258 110 3.22 19.4 21.4
#> 5 Hornet Sportabout 8 360 175 3.44 17.0 18.7
#> 6 Valiant 6 225 105 3.46 20.2 18.1
#> 7 Duster 360 8 360 245 3.57 15.8 14.3
#> 8 Merc 240D 4 147. 62 3.19 20 24.4
#> 9 Merc 230 4 141. 95 3.15 22.9 22.8
#> 10 Merc 280 6 168. 123 3.44 18.3 19.2
#> # ... with 22 more rows
We can calculate the axis breaks (and set the coordinates of the tick marks) using some simple arithmetic:
axis_df <- stack(cars[-1]) %>%
group_by(ind) %>%
summarize(breaks = pretty(values, n = 10),
yval = (breaks - min(breaks))/(max(values) - min(values))) %>%
mutate(xmin = as.numeric(ind) - 0.05,
xmax = as.numeric(ind),
x_text = as.numeric(ind) - 0.2)
And the co-ordinates for our actual axis lines like this:
axis_line_df <- axis_df %>%
group_by(ind) %>%
summarize(min = min(yval), max = max(yval))
Now we need to reshape and normalize the original data:
lines_df <- cars[-1] %>%
mutate(across(everything(), function(x) (x - min(x))/(max(x) - min(x)))) %>%
stack() %>%
mutate(row = rep(cars$model, ncol(cars) - 1))
Finally, the plotting code would be something like:
ggplot(lines_df, aes(ind, values, group = row)) +
geom_line(color = "orange", alpha = 0.5) +
geom_segment(data = axis_line_df, aes(x = ind, xend = ind, y = min, yend = max),
inherit.aes = FALSE) +
geom_segment(data = axis_df, aes(x = xmin, xend = xmax, y = yval, yend = yval),
inherit.aes = FALSE) +
geom_text(data = axis_df, aes(x = x_text, y = yval, label = breaks),
inherit.aes = FALSE) +
geom_text(data = axis_line_df, aes(x = ind, y = 1.2, label = ind),
size = 6, inherit.aes = FALSE, check_overlap = TRUE, hjust = 1) +
theme_void() +
theme(plot.margin = margin(50, 20, 50, 20))
Created on 2021-10-24 by the reprex package (v2.0.0)
Upvotes: 1