Reputation: 1105
So I have data like this:
fruit cost count
banana 23 1
orange 13 13
grape 10 32
banana 64 42
orange 23 24
grape 10 2
banana 112 12
orange 23 42
grape 64 1
And basically I would like to build a simple linear regression model for each fruit.
So a regression model for banana would only use:
banana_df
fruit cost count
banana 23 1
banana 64 42
banana 112 12
This is the equation:
banana_eq <- lm(cost ~ count, data = banana_df)
Then I get a column of predictions
banana_df$estimated <-predict(banana_eq, newdata=banana_df)
So I want to do this for every fruit. the Final df should be the same number of rows as the original, but there is now a new variable for regression estimates.
Again the critical part is the regressions are for every fruit uniquely. 5 fruits mean 5 regression models. Tidyverse solution preferred.
FINAL DF:
fruit cost count estimated
banana 23 1 XX
orange 13 13 YY
grape 10 32 ZZ
banana 64 42 XX
orange 23 24 YY
grape 10 2 ZZ
banana 112 12 XX
orange 23 42 YY
grape 64 1 ZZ
XX,YY,ZZ represent the estimates that were produced from that fruit's regression model.
Upvotes: 0
Views: 205
Reputation: 39613
In dplyr
version > 1.0, the preferred way is to use nest_by()
:
newdf <- df %>%
nest_by(fruit) %>%
mutate(model = list(lm(cost ~ count, data = data))) %>%
summarise(broom::augment(model))
Output:
# A tibble: 9 x 10
# Groups: fruit [3]
fruit cost count .fitted .se.fit .resid .hat .sigma .cooksd .std.resid
<chr> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 banana 23 1 58.5 50.2 -35.5 0.667 NaN 1.00 -1.
2 banana 64 42 77.0 60.1 -13.0 0.955 Inf 10.7 -1.00
3 banana 112 12 63.5 37.8 48.5 0.378 NaN 0.304 1
4 grape 10 32 9.13 37.5 0.870 0.999 NaN 931. 1.
5 grape 10 2 37.0 26.1 -27.0 0.484 Inf 0.469 -1
6 grape 64 1 37.9 27.0 26.1 0.517 NaN 0.534 1.00
7 orange 13 13 15.5 4.34 -2.52 0.748 NaN 1.48 -1.
8 orange 23 24 18.9 2.95 4.06 0.346 Inf 0.265 1.00
9 orange 23 42 24.5 4.78 -1.54 0.906 Inf 4.81 -1.00
The precursor to nest_by
is do
- which still works, but may not always be supported:
library(dplyr)
library(broom)
#Code
newdf <- df %>% group_by(fruit) %>%
do(fitmod = augment(lm(cost ~ count, data = .))) %>%
unnest(fitmod)
The output will look like this:
# A tibble: 9 x 10
fruit cost count .fitted .se.fit .resid .hat .sigma .cooksd .std.resid
<chr> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 banana 23 1 58.5 50.2 -35.5 0.667 NaN 1.00 -1.
2 banana 64 42 77.0 60.1 -13.0 0.955 Inf 10.7 -1.00
3 banana 112 12 63.5 37.8 48.5 0.378 NaN 0.304 1
4 grape 10 32 9.13 37.5 0.870 0.999 NaN 931. 1.
5 grape 10 2 37.0 26.1 -27.0 0.484 Inf 0.469 -1
6 grape 64 1 37.9 27.0 26.1 0.517 NaN 0.534 1.00
7 orange 13 13 15.5 4.34 -2.52 0.748 NaN 1.48 -1.
8 orange 23 24 18.9 2.95 4.06 0.346 Inf 0.265 1.00
9 orange 23 42 24.5 4.78 -1.54 0.906 Inf 4.81 -1.00
In both cases you get several useful new variables. The predicted value is named .fitted
by default.
Upvotes: 2