Giuseppe Petri
Giuseppe Petri

Reputation: 640

How can apply a loess function and get predictions by groups using dplyr in r?

I have this example data set:

data.1 <-read.csv(text = "
country,year,response
Austria,2010,34378
Austria,2011,38123
Austria,2012,37126
Austria,2013,42027
Austria,2014,43832
Austria,2015,56895
Austria,2016,49791
Austria,2017,64467
Austria,2018,67620
Austria,2019,69210
Croatia,2010,56456
Croatia,2011,58896
Croatia,2012,54109
Croatia,2013,47156
Croatia,2014,47104
Croatia,2015,88867
Croatia,2016,78614
Croatia,2017,85133
Croatia,2018,77090
Croatia,2019,78330
France,2010,50939
France,2011,41571
France,2012,37367
France,2013,42999
France,2014,75789
France,2015,122529
France,2016,136518
France,2017,141829
France,2018,153850
France,2019,163800
")

I want to adjust a loess function by country and also obtain the predicted values for each year in the data frame I am providing. The loess smoothing looks like this:

ggplot(data.1, aes(x=year, y=response, color=country)) +
  geom_point(size = 3, alpha=0.3) + 
  #geom_line(aes(x=year, y=area_harvested_ha/1000), size=0.5, alpha= 1) +
  geom_smooth(method = 'loess', span=0.75, na.rm = T, se=F, size = 2)

Plot:

enter image description here

This is the code I tried to get the prediction:

data.1.with.pred <- data.1 %>% 
  group_by(country) %>% 
  arrange(country, year) %>% 
  mutate(pred.response = stats::predict(stats::loess(response ~ year, span = .75, data=.),
                         data.frame(year = seq(min(year), max(year), 1))))

I am getting the predictions in the data frame but the grouping by country is not working.

This is the plot:

ggplot(data.1.with.pred, aes(x=year, y=pred.response, color=country)) +
  geom_point(aes(x=year, y=response), size = 3, alpha=0.3) + 
  #geom_line(aes(x=year, y=area_harvested_ha/1000), size=0.5, alpha= 1) +
  geom_smooth(method = 'loess', span=0.75, na.rm = T, se=F, size = 2)

enter image description here

The problem I have is that the grouping by country is failing. I got this answer from here:

https://stackoverflow.com/a/53400029/4880334

Thanks so much for any advice.

Upvotes: 1

Views: 1838

Answers (5)

Tulse Luper
Tulse Luper

Reputation: 21

I think the grouping is overridden by "data=." inside the loess function, or at least the prediction works by omitting it:

data.1.with.pred <- data.1 %>% 
  group_by(country) %>% 
  arrange(country, year) %>% 
  mutate(pred.response = stats::predict(stats::loess(response ~ year, span = .75), 
    data.frame(year = seq(min(year), max(year), 1))))

In the original plotting aesthetics of the ggplot and geom_line functions the y-variables seem to have gone the wrong way round, as now the plotting makes a new smoothing from the already predicted values. Below, the different predictions are plotted separately:

ggplot(data.1.with.pred, aes(x = year)) +
geom_point(aes(y = response, color = country), size = 3, alpha = 0.3) + 
geom_line(aes(y = response, color = country), alpha = 0.3) +   
geom_smooth(aes(y = pred.response, color = paste(country, "
  (geom_smooth)")), method = 'loess', span = 0.75, na.rm = T, se = F) +
geom_line(aes(y = pred.response, color = paste(country, "(stats::loess)")))  

enter image description here

Upvotes: 0

Martin Gal
Martin Gal

Reputation: 16998

Similar to Henry Holm's answer:

library(purrr)

model <- data.1 %>% 
  split(f = .$country) %>% 
  map(~stats::loess(response ~ year, span = .75, data=.x))

creates a model for each country. Now you can access the fitted values via

model$Austria$fitted
#>  [1] 35195.78 36149.17 37988.25 40221.17 47372.73 51220.11 55611.14 61368.08 66159.05 70242.01
model$Croatia$fitted
#>  [1] 59333.25 53963.12 49872.81 45156.89 57061.66 76289.39 86357.84 84047.18 81245.77 76487.97
model$France$fitted
#>  [1]  53011.15  37627.29  35403.63  45360.31  78379.48 117055.05 137189.73 146822.95 155585.16 162336.60

Upvotes: 0

Dave2e
Dave2e

Reputation: 24139

The problem here is the group_by is not playing well with the mutate/predict function.

In this solution, I split the dataframe, calculated each prediction, then combined and plotted:

#split by country
sdata <-split(data.1, data.1$country)
#calculate the predicted values for each country
data.1.with.pred <- lapply(sdata, function(df){
   df$pred.response  <-stats::predict(stats::loess(response ~ year, span = .75, data=df))
   df
})

#merge back into 1 dataframe
data.1.with.pred <-dplyr::bind_rows(data.1.with.pred )

#data.1.with.pred[order(data.1.with.pred$year),]

ggplot(data.1.with.pred, aes(x=year, y=pred.response, color=country)) +
   geom_point(aes(x=year, y=response), size = 3, alpha=0.3) + 
   #geom_line(aes(x=year, y=area_harvested_ha/1000), size=0.5, alpha= 1) +
   geom_smooth(method = 'loess', span=0.75, na.rm = T, se=F, size = 2)

enter image description here

Upvotes: 2

Henry Holm
Henry Holm

Reputation: 545

Use the loess function to make a model of the subset of your data like this:

#use a loess model on a subset of the data (France)
    model <- loess(formula = response ~ year,data = subset(data.1,country == "France"))

#plot
    ggplot() +
      geom_point(data = data.1,
                 mapping = aes(x=year, y=response, color=country),size = 3, alpha=0.3) + 
      geom_line(aes(model$x,model$fitted)) +
      geom_smooth(method = 'loess', span=0.75, na.rm = T, se=F, size = 2)

Fitted values are in model$fitted

Upvotes: 1

Christopher Belanger
Christopher Belanger

Reputation: 631

If you want to get loess predictions for each country, you might want to use a nest()ed data frame. This will let you set up a column that contains data frames for country-specific data, and then run loess() and predict() on those individual data frames, then unnest() to bring the results back into a standard format.

Here's some code that nests your data, runs the analysis on each country, then pulls it back out to a regular data frame:

library(tidyverse)

data.1.with.pred <- data.1 %>% 
  group_by(country) %>% 
  arrange(country, year) %>% 
  nest() %>%
  mutate(pred.response = purrr::map(data, function(x)stats::loess(response~year, span= 0.75, data = x) %>%
                             stats::predict(data.frame(year = seq(min(x$year), max(x$year), 1))))) %>%
  unnest(cols = c(data, pred.response))

data.1.with.pred %>%
  ggplot() +
  geom_point(aes(x = year, y = response, colour = country)) +
  geom_line(aes(x = year,y=pred.response, colour = country))

The resulting data frame has annual loess predictions for each country, as opposed to for all countries together, and the plot looks like this: a ggplot

Is this what you were trying to do?

Upvotes: 3

Related Questions