Reputation: 25
I began to have this question while reading this book "R for Data Science" by Hadley Wickham. In the chapter of "Data Transformation", the author used this example:
library(nycflights13)
library(tidyverse)
by_dest <- group_by(flights, dest)
delay <- summarise(by_dest,
count = n(),
dist = mean(distance, na.rm = TRUE),
delay = mean(arr_delay, na.rm = TRUE)
)
#> `summarise()` ungrouping output (override with `.groups` argument)
delay <- filter(delay, count > 20, dest != "HNL")
# It looks like delays increase with distance up to ~750 miles
# and then decrease. Maybe as flights get longer there's more
# ability to make up delays in the air?
ggplot(data = delay, mapping = aes(x = dist, y = delay)) +
geom_point(aes(size = count), alpha = 1/3) +
geom_smooth(se = FALSE)
#> `geom_smooth()` using method = 'loess' and formula 'y ~ x'
The output plot looks like as follows: relationship between delay and flight distance
So I wonder if we can know the extremum of the curve and the corresponding x value using R?
From this answer, I've figured using stat_poly_eq() from ggpmisc to calculate the polynomial regression equation:
library(ggpmisc)
formula=y ~ poly(x, 3, raw=TRUE)
p <- ggplot(data = delay, mapping = aes(x = dist, y = delay))
p <- p + geom_point(aes(size = count), alpha = 1/3)
p <- p+ geom_smooth(method = "lm", formula = formula, se = FALSE)
(p1 <- p+ stat_poly_eq(formula = formula, aes(label = paste(..eq.label.., ..rr.label.., sep = "~~~")), parse = TRUE))
And use that equation to find out the extremum of the curve, which I think it's very not clever. So I wonder how can I figure out the extremum and the corresponding x value of my regression curve (not the max and min value of raw data but the max and the min value of the regression curve).
Upvotes: 0
Views: 105
Reputation: 1237
You can see in your first code snippet, your ggplot
call outputs this message:
#> `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Looking at the documentation for geom_smooth
and stat_smooth
:
stats::loess() is used for less than 1,000 observations; otherwise mgcv::gam() is used with formula = y ~ s(x, bs = "cs") with method = "REML".
So you could just call stats::loess
on your data to fit the model implied by the geom_smooth
curve, and then use the predict
method to get values outside of your actual range.
Upvotes: 0