Using conditions in group_by()/summarize() loop

Question

I have a dataframe that looks something like this (I have a lot more years and variables):

Name    State2014     State2015  State2016  Tuition2014   Tuition2015  Tuition2016  StateGrants2014
Jared   CA            CA         MA         22430         23060        40650        5000
Beth    CA            CA         CA         36400         37050        37180        4200
Steven  MA            MA         MA         18010         18250        18720        NA
Lary    MA            CA         MA         24080         30800        24600        6600
Tom     MA            OR         OR         40450         15800        16040        NA
Alfred  OR            OR         OR         23570         23680        23750        3500
Cathy   OR            OR         OR         32070         32070        33040        4700

My objective (in this example) is to get the mean tuition for each state, and the sum of state grants for each state. My thought was to subset the data by year:

State2014     Tuition2014   StateGrants2014
CA            22430         5000
CA            36400         4200
MA            18010         NA
MA            24080         6600
MA            40450         NA
OR            23570         3500
OR            32070         4700

State2015  Tuition2015  
CA         23060        
CA         37050        
MA         18250        
CA         30800        
OR         15800        
OR         23680        
OR         32070       

State2016  Tuition2016  
MA         40650        
CA         37180        
MA         18720        
MA         24600        
OR         16040        
OR         23750        
OR         33040

Then I would group_by state and summarize (and save each as a separate df) to get the following:

State2014     Tuition2014   StateGrants2014
CA            29415         9200
MA            27513         6600
OR            27820         6600

State2015  Tuition2015  
CA         30303        
MA         18250        
OR         23850    

State2016  Tuition2016  
CA         37180        
MA         27990        
OR         24277

Then I would merge the by state. Here is my code:

years = c(2014,2015,2016)
for (i in seq_along(years){
  #grab the variables from a certain year and save as a new df.
  df_year <- df[, grep(paste(years[[i]],"$",sep=""), colnames(df))]

  #Take off the year from each variable name (to make it easier to summarize)
  names(df_year) <- gsub(years[[i]], "", names(df_year), fixed = TRUE)

  df_year <- df_year %>%
    group_by(state) %>%
    summarize(Tuition = mean(Tuition, na.rm = TRUE),
            #this part of the code does not work. In this example, I only want to have this part if the year is 2016.
              if (years[[i]]=='2016')
                {Stategrant = mean(Stategrant, na.rm = TRUE)})

  #rename df_year to df####
  assign(paste("df",years[[i]],sep=''),df_year)
}

I have about 50 years of data, and a good amount of variables, so I wanted to use a loop. So my question is, how do i add a conditional statement (summarize certain variables conditioned on the year) in the group_by()/summarize() function? Thanks!

*Edit: I realize that I could take the if{} out of the function, and do something like:

  if (years[[i]]==2016){
      df_year <- df_year %>%
        group_by(state) %>%
        summarize(Tuition = mean(Tuition, na.rm = TRUE),
            Stategrant = mean(Stategrant, na.rm = TRUE))

      #rename df_year to df####
      assign(paste("df",years[[i]],sep=''),df_year)
  }

  else{
        df_year <- df_year %>%
            group_by(state) %>%
            summarize(Tuition = mean(Tuition, na.rm = TRUE))

          #rename df_year to df####
          assign(paste("df",years[[i]],sep=''),df_year)
  {
}

but there are just so many combinations of variables, that using a for loop would not be very efficient or useful.

coffeinjunky · Accepted Answer

This is so much easier with tidy data, so let me show you how to tidy up your data. See http://r4ds.had.co.nz/tidy-data.html.

library(tidyr)
library(dplyr)

df <- gather(df, key, value, -Name) %>% 
  # separate years from the variables
  separate(key, c("var", "year"), sep = -5) %>% 
  # the above line splits up e.g. State2014 into State and 2014.
  # It does so by splitting at the fifth element from the end of the
  # entry. Please check that this works for your other variables
  # in case your naming conventions are inconsistent.
  spread(var, value) %>% 
  # turn numbers back to numeric
  mutate_at(.cols = c("Tuition", "StateGrants"), as.numeric) %>% 
  gather(var, val, -Name, -year, -State) %>% 
  # group by the variables of interest. Note that `var` here 
  # refers to Tuition and StateGrants. If you have more variables,
  # they will be included here as well. If you want to exclude more
  # variables from being included here in `var`, add more "-colName" 
  # entries in the `gather` statement above
  group_by(year, State, var) %>% 
  # summarize:
  summarise(mean_values = mean(val))

This gives you:

Source: local data frame [18 x 4]
Groups: year, State [?]
    year State         var mean_values
                  
1   2014    CA StateGrants     4600.00
2   2014    CA     Tuition    29415.00
3   2014    MA StateGrants          NA
4   2014    MA     Tuition    27513.33
5   2014    OR StateGrants     4100.00
6   2014    OR     Tuition    27820.00
7   2015    CA StateGrants          NA
8   2015    CA     Tuition    30303.33
9   2015    MA StateGrants          NA
10  2015    MA     Tuition    18250.00
11  2015    OR StateGrants          NA
12  2015    OR     Tuition    23850.00
13  2016    CA StateGrants          NA
14  2016    CA     Tuition    37180.00
15  2016    MA StateGrants          NA
16  2016    MA     Tuition    27990.00
17  2016    OR StateGrants          NA
18  2016    OR     Tuition    24276.67

If you don't like the shape of this, you can e.g. add an %>% spread(var, mean_values) behind the summarise statement to have the means for Tuition and StateGrants in different columns.

If you want to compute different functions for Tuition and Grants (e.g. mean of Tuition and sum for grants, you could do the following:

df <- gather(df, key, value, -Name) %>% 
   separate(key, c("var", "year"), sep = -5) %>% 
   spread(var, value) %>% 
   mutate_at(.cols = c("Tuition", "StateGrants"), as.numeric) %>% 
   group_by(year, State) %>% 
   summarise(Grant_Sum = sum(StateGrants, na.rm=T), Tuition_Mean = mean(Tuition) )

This gives you:

Source: local data frame [9 x 4]
Groups: year [?]

   year State Grant_Sum Tuition_Mean
                
1  2014    CA      9200     29415.00
2  2014    MA      6600     27513.33
3  2014    OR      8200     27820.00
4  2015    CA         0     30303.33
5  2015    MA         0     18250.00
6  2015    OR         0     23850.00
7  2016    CA         0     37180.00
8  2016    MA         0     27990.00
9  2016    OR         0     24276.67

Note that I used sum here, with na.rm = T, which returns 0 if all elements are NAs. Make sure this makes sense in your use case.

Also, just to mention it, to get your individual data.frames that you asked for, you can use filter(year == 2014) etc, as in df_2014 <- filter(df, year == 2014).

Using conditions in group_by()/summarize() loop

Answers (1)

Related Questions