Visualizing the relationship between a continuous predictor and a categorical outcome

Question

I am trying to visualize the relationship between a continuous predictor (range 0-0.8) and a discrete outcome (count variable, possible values: 0, 1, 2).

There are many options to show the discrete variable on the x-axis, with the continuous variable on the y-axis (e.g., dotplot, violin, boxplot, etc). These options show the distribution of the continuous predictor with a measure of centrality for each group of the discrete variable. However, this does not show the message I am trying to portray. I want to show the likelihood of having either an increased value for the discrete variable with increasing scores of the continuous variable.

I have tried doing this with geom_smooth, but since the outcome is discrete, this seems misleading:

p <- ggplot(pheno, aes(adhdanx, polye))
p + geom_smooth(method = "lm", colour = "#007ea7", size = 0.5, fill = "#007ea7")

I am working in R. All suggestions are welcome.

younggeun · Accepted Answer

As I know, for the linear regression model only with categorical predictor, there cannot be a line fit. You can draw each point. Here I would use iris data set.

library(tidyverse)
as_tibble(iris)
#> # A tibble: 150 x 5
#>    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#>                                    
#>  1          5.1         3.5          1.4         0.2 setosa 
#>  2          4.9         3            1.4         0.2 setosa 
#>  3          4.7         3.2          1.3         0.2 setosa 
#>  4          4.6         3.1          1.5         0.2 setosa 
#>  5          5           3.6          1.4         0.2 setosa 
#>  6          5.4         3.9          1.7         0.4 setosa 
#>  7          4.6         3.4          1.4         0.3 setosa 
#>  8          5           3.4          1.5         0.2 setosa 
#>  9          4.4         2.9          1.4         0.2 setosa 
#> 10          4.9         3.1          1.5         0.1 setosa 
#> # ... with 140 more rows

Consider regression problem Petal.width ~ Species.

iris %>%
  ggplot() +
  aes(x = Species, y = Petal.Width, colour = Species) +
  geom_boxplot(show.legend = FALSE)

From this box plot, you can see the distribution of Petal.width in each Species and the positive relationship. For qualitative predictor, the variable would be coded like:

contrasts(iris$Species)
#>            versicolor virginica
#> setosa              0         0
#> versicolor          1         0
#> virginica           0         1

so that the model becomes

where

and

Thus, each fitted value would becomes

from these estimates

lm(Petal.Width ~ Species, data = iris)
#> 
#> Call:
#> lm(formula = Petal.Width ~ Species, data = iris)
#> 
#> Coefficients:
#>       (Intercept)  Speciesversicolor   Speciesvirginica  
#>             0.246              1.080              1.780

With these facts, as mentioned, each fitted value can be drawn on the plot.

From lm():

iris %>%
  select(Species, Petal.Width) %>% # just for clarity
  mutate(pred = lm(Petal.Width ~ Species)$fitted.values) %>% # linear regression
  ggplot() +
  aes(x = Species, y = Petal.Width) +
  geom_point() +
  geom_point(aes(x = Species, y = pred), col = "red", size = 3) # fitted values

Alternatively, noting that each fitted value is sample mean,

iris %>%
  select(Species, Petal.Width) %>%
  group_by(Species) %>% # for each category
  mutate(pred = mean(Petal.Width)) %>% # sample mean of response in each category
  ggplot() +
  aes(x = Species, y = Petal.Width) +
  geom_point() +
  geom_point(aes(x = Species, y = pred), col = "red", size = 3)

Visualizing the relationship between a continuous predictor and a categorical outcome

Answers (1)

Related Questions