Varun
Varun

Reputation: 1321

R Capturing regression slopes by group in a dataframe

My dataframe consists of scores for different questions asked in a survey, over 3 fiscal years (FY13, FY14 & FY15). The results are presented by Region.

Here's what a sample of the actual dataframe looks like, where we have two questions per region, asked in different years.

testdf=data.frame(FY=c("FY13","FY14","FY15","FY14","FY15","FY13","FY14","FY15","FY13","FY15","FY13","FY14","FY15","FY13","FY14","FY15"),
              Region=c(rep("AFRICA",5),rep("ASIA",5),rep("AMERICA",6)),
              QST=c(rep("Q2",3),rep("Q5",2),rep("Q2",3),rep("Q5",2),rep("Q2",3),rep("Q5",3)),
              Very.Satisfied=runif(16,min = 0, max=1),
              Total.Very.Satisfied=floor(runif(16,min=10,max=120)))

My Objective

For each region, my objective is to identify which question experienced the most significant upward evolution across this 3 year time frame. In order to measure significant upward movements, I have decided to use the slope of regression as a parameter.

The question with the most significant upward evolution within a region over the 3 years time frame will be the one with the steepest positive slope.

Using this logic, I have decided to do the following -

1) For each combination of Region and QST, I run the lm function.

2) I extract the slope for each combination, and store it as a separate variable. Then for each region I filter out the question with the maximum slope value.

My Attempt

Here is my attempt at solving this.

test_final=testdf %>%   
group_by(Region,QST) %>% 
map(~lm(FY ~ Very.Satisfied, data = .)) %>%
map_df(tidy) %>%
filter(term == 'circumference') %>%
select(estimate) %>% 
summarise(Value = max(estimate))

However when I run this I get an error message saying that object FY was not found.

Additional requirement

Also I'd like this to work only for questions that have at least 2 consecutive years of data for comparison. But I'm unable to figure out how to factor this condition into my code.

Any help with this would be greatly appreciated.

Upvotes: 1

Views: 320

Answers (2)

gfgm
gfgm

Reputation: 3647

Here is a similar version with filtering by group size/contiguity (had written it by the time you posted so figured I might as well go ahead).

library(tidyverse)
set.seed(42)
testdf=data.frame(FY=c("FY13","FY14","FY15","FY14","FY15","FY13","FY14","FY15","FY13","FY15","FY13","FY14","FY15","FY13","FY14","FY15"),
                  Region=c(rep("AFRICA",5),rep("ASIA",5),rep("AMERICA",6)),
                  QST=c(rep("Q2",3),rep("Q5",2),rep("Q2",3),rep("Q5",2),rep("Q2",3),rep("Q5",3)),
                  Very.Satisfied=runif(16,min = 0, max=1),
                  Total.Very.Satisfied=floor(runif(16,min=10,max=120)))

test_final <- testdf %>%   
  group_by(Region,QST) %>% # group by region
  mutate(numdate = as.numeric(str_remove(FY, "FY"))) %>% 
  filter(n() >= 2 & max(diff(numdate)) < 2) %>% # filter out singleton groups
  mutate(slopes = coef(lm(Very.Satisfied~numdate))[2])
test_final %>% select(Region, QST, slopes)
#> # A tibble: 14 x 3
#> # Groups:   Region, QST [5]
#>    Region  QST   slopes
#>    <fct>   <fct>  <dbl>
#>  1 AFRICA  Q2    -0.314
#>  2 AFRICA  Q2    -0.314
#>  3 AFRICA  Q2    -0.314
#>  4 AFRICA  Q5    -0.189
#>  5 AFRICA  Q5    -0.189
#>  6 ASIA    Q2    -0.192
#>  7 ASIA    Q2    -0.192
#>  8 ASIA    Q2    -0.192
#>  9 AMERICA Q2     0.238
#> 10 AMERICA Q2     0.238
#> 11 AMERICA Q2     0.238
#> 12 AMERICA Q5     0.342
#> 13 AMERICA Q5     0.342
#> 14 AMERICA Q5     0.342

test_final %>% group_by(Region) %>% 
  summarise(Value = max(slopes),
            Top_Question = QST[which.max(slopes)])
#> # A tibble: 3 x 3
#>   Region   Value Top_Question
#>   <fct>    <dbl> <fct>       
#> 1 AFRICA  -0.189 Q5          
#> 2 AMERICA  0.342 Q5          
#> 3 ASIA    -0.192 Q2

Created on 2019-01-21 by the reprex package (v0.2.1)

Upvotes: 3

A. S. K.
A. S. K.

Reputation: 2816

This doesn't do the "at least two consecutive years" part, but it does the "get the question with the largest slope" part:

library(dplyr)
test_final = testdf %>%
  mutate(FY.num = as.numeric(gsub("FY", "", FY))) %>%
  group_by(Region, QST) %>%
  mutate(lm_slope = lm(Very.Satisfied ~ FY.num)$coefficients[["FY.num"]]) %>%
  ungroup() %>%
  group_by(Region) %>%
  filter(lm_slope == max(lm_slope))

Upvotes: 4

Related Questions