dam4l10
dam4l10

Reputation: 349

Visualizing the relationship between a continuous predictor and a categorical outcome

I am trying to visualize the relationship between a continuous predictor (range 0-0.8) and a discrete outcome (count variable, possible values: 0, 1, 2).

There are many options to show the discrete variable on the x-axis, with the continuous variable on the y-axis (e.g., dotplot, violin, boxplot, etc). These options show the distribution of the continuous predictor with a measure of centrality for each group of the discrete variable. However, this does not show the message I am trying to portray. I want to show the likelihood of having either an increased value for the discrete variable with increasing scores of the continuous variable.

I have tried doing this with geom_smooth, but since the outcome is discrete, this seems misleading:

p <- ggplot(pheno, aes(adhdanx, polye))
p + geom_smooth(method = "lm", colour = "#007ea7", size = 0.5, fill = "#007ea7")

Plot

I am working in R. All suggestions are welcome.

Upvotes: 0

Views: 2135

Answers (1)

younggeun
younggeun

Reputation: 953

As I know, for the linear regression model only with categorical predictor, there cannot be a line fit. You can draw each point. Here I would use iris data set.

library(tidyverse)
as_tibble(iris)
#> # A tibble: 150 x 5
#>    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#>           <dbl>       <dbl>        <dbl>       <dbl> <fct>  
#>  1          5.1         3.5          1.4         0.2 setosa 
#>  2          4.9         3            1.4         0.2 setosa 
#>  3          4.7         3.2          1.3         0.2 setosa 
#>  4          4.6         3.1          1.5         0.2 setosa 
#>  5          5           3.6          1.4         0.2 setosa 
#>  6          5.4         3.9          1.7         0.4 setosa 
#>  7          4.6         3.4          1.4         0.3 setosa 
#>  8          5           3.4          1.5         0.2 setosa 
#>  9          4.4         2.9          1.4         0.2 setosa 
#> 10          4.9         3.1          1.5         0.1 setosa 
#> # ... with 140 more rows

Consider regression problem Petal.width ~ Species.

iris %>%
  ggplot() +
  aes(x = Species, y = Petal.Width, colour = Species) +
  geom_boxplot(show.legend = FALSE)

enter image description here

From this box plot, you can see the distribution of Petal.width in each Species and the positive relationship. For qualitative predictor, the variable would be coded like:

contrasts(iris$Species)
#>            versicolor virginica
#> setosa              0         0
#> versicolor          1         0
#> virginica           0         1

so that the model becomes

enter image description here

where

enter image description here

and

enter image description here

Thus, each fitted value would becomes

enter image description here

from these estimates

lm(Petal.Width ~ Species, data = iris)
#> 
#> Call:
#> lm(formula = Petal.Width ~ Species, data = iris)
#> 
#> Coefficients:
#>       (Intercept)  Speciesversicolor   Speciesvirginica  
#>             0.246              1.080              1.780

With these facts, as mentioned, each fitted value can be drawn on the plot.

From lm():

iris %>%
  select(Species, Petal.Width) %>% # just for clarity
  mutate(pred = lm(Petal.Width ~ Species)$fitted.values) %>% # linear regression
  ggplot() +
  aes(x = Species, y = Petal.Width) +
  geom_point() +
  geom_point(aes(x = Species, y = pred), col = "red", size = 3) # fitted values

enter image description here

Alternatively, noting that each fitted value is sample mean,

iris %>%
  select(Species, Petal.Width) %>%
  group_by(Species) %>% # for each category
  mutate(pred = mean(Petal.Width)) %>% # sample mean of response in each category
  ggplot() +
  aes(x = Species, y = Petal.Width) +
  geom_point() +
  geom_point(aes(x = Species, y = pred), col = "red", size = 3)

enter image description here

Upvotes: 1

Related Questions