AlSub
AlSub

Reputation: 1055

Loop to plot boxplot with ggplot

I am using diamonds df,

I would like to plot a boxplot for each numerical column by category, In this case category would be defined by "cut" column.

I am using a for-loop to accomplish this task,

Here's the code I am using:


##################################################################################
#                              Data                                              #
#                                                                                #
##################################################################################

data("diamonds")
basePlot <- diamonds[ names(diamonds)[!names(diamonds) %in% c("color", "clarity")] ]

##################################################################################

## set Plot view to 4 boxplots ##
par(mfrow = c(2,2))

## for-loop to boxplot all numerical columns ##

for (i in 1:(ncol(basePlot)-1)){
  print(ggplot(basePlot, aes(as.factor(cut), 
  basePlot[c(i)],color=as.factor(cut)))
        + geom_boxplot(outlier.colour="black",outlier.shape=16,outlier.size=1,notch=FALSE)
        + xlab("Diamond Cut")
        + ylab(colnames(basePlot)[i])
  )
}


Console output:

Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
Error in is.finite(x) : default method not implemented for type 'list'

Is there any other way to accomplish this task?

Upvotes: 0

Views: 752

Answers (1)

r2evans
r2evans

Reputation: 161155

Instead of multiple plots, I suggest facets. To do this, though, we need to convert the data from "wide" format to "longer" format, and the canonical way in the tidyverse is with tidyr::pivot_longer.

> basePlot
# A tibble: 53,940 x 8
   carat cut       depth table price     x     y     z
   <dbl> <ord>     <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1 0.23  Ideal      61.5    55   326  3.95  3.98  2.43
 2 0.21  Premium    59.8    61   326  3.89  3.84  2.31
 3 0.23  Good       56.9    65   327  4.05  4.07  2.31
 4 0.290 Premium    62.4    58   334  4.2   4.23  2.63
 5 0.31  Good       63.3    58   335  4.34  4.35  2.75
 6 0.24  Very Good  62.8    57   336  3.94  3.96  2.48
 7 0.24  Very Good  62.3    57   336  3.95  3.98  2.47
 8 0.26  Very Good  61.9    55   337  4.07  4.11  2.53
 9 0.22  Fair       65.1    61   337  3.87  3.78  2.49
10 0.23  Very Good  59.4    61   338  4     4.05  2.39
# ... with 53,930 more rows
> pivot_longer(basePlot, -cut, names_to="var", values_to="val")
# A tibble: 377,580 x 3
   cut     var      val
   <ord>   <chr>  <dbl>
 1 Ideal   carat   0.23
 2 Ideal   depth  61.5 
 3 Ideal   table  55   
 4 Ideal   price 326   
 5 Ideal   x       3.95
 6 Ideal   y       3.98
 7 Ideal   z       2.43
 8 Premium carat   0.21
 9 Premium depth  59.8 
10 Premium table  61   
# ... with 377,570 more rows

With this, we only have to tell ggplot2 to worry about val for the values, and var for the x-axis.

library(ggplot2)
library(tidyr) # pivot_longer

ggplot(pivot_longer(basePlot, -cut, names_to="var", values_to="val"),
       aes(cut, val, color=cut)) +
  geom_boxplot(outlier.colour="black", outlier.shape=16, outlier.size=1, notch=FALSE) +
  xlab("Diamond Cut") +
  facet_wrap(~var, nrow=2, scales="free") +
  scale_x_discrete(guide=guide_axis(n.dodge=2))

ggplot2, faceted boxplots

The reason you have cut both in the x-axis and in the legend is because color= will add the legend. Since it's redundant, we could either remove the color aesthetic (which would also remove the legend) or we could just suppress the legend (by adding + scale_color_discrete(guide=FALSE)).

There are two ways of faceting: facet_wrap and facet_grid. The latter is well tuned for multiple variables (one facet variable on the x, one on the y) and many other configurations. Granted, you can use facet_grid with just one variable (which is similar to facet_wrap(nrow=1) or ncol=1), but there are some styling distinctions between them.

Upvotes: 1

Related Questions