John Thomas
John Thomas

Reputation: 1105

Create model on different filters of data using dplyr

So I have data like this:

fruit   cost  count 
banana   23       1
orange   13      13
grape    10      32
banana   64      42
orange   23      24
grape    10       2
banana   112     12
orange   23      42
grape    64       1

And basically I would like to build a simple linear regression model for each fruit.

So a regression model for banana would only use:

banana_df

fruit    cost count
banana   23       1
banana   64      42
banana   112     12

This is the equation:

banana_eq <- lm(cost ~ count, data = banana_df)

Then I get a column of predictions

banana_df$estimated <-predict(banana_eq, newdata=banana_df)

So I want to do this for every fruit. the Final df should be the same number of rows as the original, but there is now a new variable for regression estimates.

Again the critical part is the regressions are for every fruit uniquely. 5 fruits mean 5 regression models. Tidyverse solution preferred.

FINAL DF:

fruit   cost  count  estimated
banana   23       1         XX
orange   13      13         YY
grape    10      32         ZZ
banana   64      42         XX
orange   23      24         YY
grape    10       2         ZZ
banana   112     12         XX
orange   23      42         YY
grape    64       1         ZZ

XX,YY,ZZ represent the estimates that were produced from that fruit's regression model.

Upvotes: 0

Views: 205

Answers (1)

Duck
Duck

Reputation: 39613

In dplyr version > 1.0, the preferred way is to use nest_by():

newdf <- df %>%
  nest_by(fruit) %>%
  mutate(model = list(lm(cost ~ count, data = data))) %>%
  summarise(broom::augment(model))

Output:

# A tibble: 9 x 10
# Groups:   fruit [3]
  fruit   cost count .fitted .se.fit  .resid  .hat .sigma .cooksd .std.resid
  <chr>  <int> <int>   <dbl>   <dbl>   <dbl> <dbl>  <dbl>   <dbl>      <dbl>
1 banana    23     1   58.5    50.2  -35.5   0.667    NaN   1.00       -1.  
2 banana    64    42   77.0    60.1  -13.0   0.955    Inf  10.7        -1.00
3 banana   112    12   63.5    37.8   48.5   0.378    NaN   0.304       1   
4 grape     10    32    9.13   37.5    0.870 0.999    NaN 931.          1.  
5 grape     10     2   37.0    26.1  -27.0   0.484    Inf   0.469      -1   
6 grape     64     1   37.9    27.0   26.1   0.517    NaN   0.534       1.00
7 orange    13    13   15.5     4.34  -2.52  0.748    NaN   1.48       -1.  
8 orange    23    24   18.9     2.95   4.06  0.346    Inf   0.265       1.00
9 orange    23    42   24.5     4.78  -1.54  0.906    Inf   4.81       -1.00

The precursor to nest_by is do - which still works, but may not always be supported:

library(dplyr)
library(broom)
#Code
newdf <- df %>% group_by(fruit) %>%
  do(fitmod = augment(lm(cost ~ count, data = .))) %>% 
  unnest(fitmod)

The output will look like this:

# A tibble: 9 x 10
  fruit   cost count .fitted .se.fit  .resid  .hat .sigma .cooksd .std.resid
  <chr>  <int> <int>   <dbl>   <dbl>   <dbl> <dbl>  <dbl>   <dbl>      <dbl>
1 banana    23     1   58.5    50.2  -35.5   0.667    NaN   1.00       -1.  
2 banana    64    42   77.0    60.1  -13.0   0.955    Inf  10.7        -1.00
3 banana   112    12   63.5    37.8   48.5   0.378    NaN   0.304       1   
4 grape     10    32    9.13   37.5    0.870 0.999    NaN 931.          1.  
5 grape     10     2   37.0    26.1  -27.0   0.484    Inf   0.469      -1   
6 grape     64     1   37.9    27.0   26.1   0.517    NaN   0.534       1.00
7 orange    13    13   15.5     4.34  -2.52  0.748    NaN   1.48       -1.  
8 orange    23    24   18.9     2.95   4.06  0.346    Inf   0.265       1.00
9 orange    23    42   24.5     4.78  -1.54  0.906    Inf   4.81       -1.00

In both cases you get several useful new variables. The predicted value is named .fitted by default.

Upvotes: 2

Related Questions