Reputation: 349
I am trying to visualize the relationship between a continuous predictor (range 0-0.8) and a discrete outcome (count variable, possible values: 0, 1, 2).
There are many options to show the discrete variable on the x-axis, with the continuous variable on the y-axis (e.g., dotplot, violin, boxplot, etc). These options show the distribution of the continuous predictor with a measure of centrality for each group of the discrete variable. However, this does not show the message I am trying to portray. I want to show the likelihood of having either an increased value for the discrete variable with increasing scores of the continuous variable.
I have tried doing this with geom_smooth, but since the outcome is discrete, this seems misleading:
p <- ggplot(pheno, aes(adhdanx, polye))
p + geom_smooth(method = "lm", colour = "#007ea7", size = 0.5, fill = "#007ea7")
I am working in R. All suggestions are welcome.
Upvotes: 0
Views: 2135
Reputation: 953
As I know, for the linear regression model only with categorical predictor, there cannot be a line fit. You can draw each point. Here I would use iris
data set.
library(tidyverse)
as_tibble(iris)
#> # A tibble: 150 x 5
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> <dbl> <dbl> <dbl> <dbl> <fct>
#> 1 5.1 3.5 1.4 0.2 setosa
#> 2 4.9 3 1.4 0.2 setosa
#> 3 4.7 3.2 1.3 0.2 setosa
#> 4 4.6 3.1 1.5 0.2 setosa
#> 5 5 3.6 1.4 0.2 setosa
#> 6 5.4 3.9 1.7 0.4 setosa
#> 7 4.6 3.4 1.4 0.3 setosa
#> 8 5 3.4 1.5 0.2 setosa
#> 9 4.4 2.9 1.4 0.2 setosa
#> 10 4.9 3.1 1.5 0.1 setosa
#> # ... with 140 more rows
Consider regression problem Petal.width ~ Species
.
iris %>%
ggplot() +
aes(x = Species, y = Petal.Width, colour = Species) +
geom_boxplot(show.legend = FALSE)
From this box plot, you can see the distribution of Petal.width
in each Species
and the positive relationship. For qualitative predictor, the variable would be coded like:
contrasts(iris$Species)
#> versicolor virginica
#> setosa 0 0
#> versicolor 1 0
#> virginica 0 1
so that the model becomes
where
and
Thus, each fitted value would becomes
from these estimates
lm(Petal.Width ~ Species, data = iris)
#>
#> Call:
#> lm(formula = Petal.Width ~ Species, data = iris)
#>
#> Coefficients:
#> (Intercept) Speciesversicolor Speciesvirginica
#> 0.246 1.080 1.780
With these facts, as mentioned, each fitted value can be drawn on the plot.
From lm()
:
iris %>%
select(Species, Petal.Width) %>% # just for clarity
mutate(pred = lm(Petal.Width ~ Species)$fitted.values) %>% # linear regression
ggplot() +
aes(x = Species, y = Petal.Width) +
geom_point() +
geom_point(aes(x = Species, y = pred), col = "red", size = 3) # fitted values
Alternatively, noting that each fitted value is sample mean,
iris %>%
select(Species, Petal.Width) %>%
group_by(Species) %>% # for each category
mutate(pred = mean(Petal.Width)) %>% # sample mean of response in each category
ggplot() +
aes(x = Species, y = Petal.Width) +
geom_point() +
geom_point(aes(x = Species, y = pred), col = "red", size = 3)
Upvotes: 1