kaseyzapatka
kaseyzapatka

Reputation: 169

new variable by subtracting rows based on condition in grouped data

I have data for every census tract in the country at three time points (2000, 2013, 2019) that is grouped by CBSA. I'm trying to create a new variable called cont_chg_pedu_colplus that is the difference between the 2013 and 2000 value for pedu_colplus. So, in my example below, I want to create a new column called cont_chg_pedu_colplus that returns the value of 3.0 (14.6 - 11.6). Ideally, each group of tracts would have the same value, since I'm only interested in the difference between time 1 and time 2.

  tractid      year CBSA_name   pedu_colplus
  <chr>       <dbl> <chr>              <dbl>
1 48059030101  2000 Abilene, TX         11.6
2 48059030101  2013 Abilene, TX         14.6
3 48059030101  2019 Abilene, TX         20.6
4 48059030102  2000 Abilene, TX         11.6
5 48059030102  2013 Abilene, TX         14.2
6 48059030102  2019 Abilene, TX         20.2

Below is the code I have so far. It throws the following error, I think because I'm subsetting on just one year (37 rows instead of the 111 in the dataset). I'd prefer not to make my data wide, because I've got a bunch of other data manipulations I have to. I couldn't get lag to work.


gent_vars_prelim <- outcome_data %>% 
    
  mutate(cont_chg_pedu_colplus = pedu_colplus[year == 2013] - pedu_colplus[year == 2000], na.rm = TRUE) %>%
                
  glimpse()

Problem with mutate() input cont_chg_pedu_colplus. x Input cont_chg_pedu_colplus can't be recycled to size 37. ℹ Input cont_chg_pedu_colplus is pedu_colplus[year == 2013] - pedu_colplus[year == 2000]. ℹ Input cont_chg_pedu_colplus must be size 37 or 1, not 0. ℹ The error occurred in group 1: CBSA_name = "Abilene, TX", year = 2000

Any thoughts? Thanks.

Upvotes: 0

Views: 1119

Answers (1)

Dillon
Dillon

Reputation: 275

I'll assume that for each unique pair of tractid and CBSA_name, there are up to 3 entries for year (possible values 2000, 2013, or 2019) and no two years are the same for a given pair of tractid and CBSA_name.

First, we'll group the values in the data frame by tractid and CBSA_name. Each group will have up to 3 rows, one for each year. We do this with dplyr::group_by(tractid, CBSA_name).

Next, we'll force the group to have all 3 years. We do this with tidyr::complete(year = c(2000, 2013, 2019)) as you suggested in the comments. (This is better than my comment using filter(n() == 3), because we actually wouldn't care if only 2019 were missing, and we are able to preserve incomplete groups.)

Then, we can compute the result you're interested in: dplyr::mutate(cont_chg_pedu_colplus = pedu_colplus[year == 2013] - pedu_colplus[year == 2000]). We just need to dplyr::ungroup() after this and we're done.

Final code:

gent_vars_prelim <- outcome_data %>%
  dplyr::group_by(tractid, CBSA_name) %>%
  tidyr::complete(year = c(2000, 2013, 2019)) %>%
  dplyr::mutate(cont_chg_pedu_colplus = pedu_colplus[year == 2013] - pedu_colplus[year == 2000]) %>%
  dplyr::ungroup() %>%
  glimpse()

Test case:

outcome_data <- data.frame(tractid = c(48059030101, 48059030101, 48059030101, 48059030101, 48059030101, 48059030101, 48059030102, 48059030102, 48059030102, 48059030103),
                           year = c(2000, 2013, 2019, 2000, 2013, 2019, 2000, 2013, 2019, 2000),
                           CBSA_name = c("Abilene, TX", "Abilene, TX", "Abilene, TX", "Austin, TX", "Austin, TX", "Austin, TX", "Abilene, TX", "Abilene, TX", "Abilene, TX", "Abilene, TX"),
                           pedu_colplus = c(11.6, 14.6, 20.6, 8.4, 9.0, 9.6, 11.6, 14.2, 20.2, 4.0))

Result:

> tibble(gent_vars_prelim)
# A tibble: 12 x 1
   gent_vars_prelim$tractid $CBSA_name  $year $pedu_colplus $cont_chg_pedu_colplus
                      <dbl> <fct>       <dbl>         <dbl>                  <dbl>
 1              48059030101 Abilene, TX  2000          11.6                  3    
 2              48059030101 Abilene, TX  2013          14.6                  3    
 3              48059030101 Abilene, TX  2019          20.6                  3    
 4              48059030101 Austin, TX   2000           8.4                  0.600
 5              48059030101 Austin, TX   2013           9                    0.600
 6              48059030101 Austin, TX   2019           9.6                  0.600
 7              48059030102 Abilene, TX  2000          11.6                  2.60 
 8              48059030102 Abilene, TX  2013          14.2                  2.60 
 9              48059030102 Abilene, TX  2019          20.2                  2.60 
10              48059030103 Abilene, TX  2000           4                   NA    
11              48059030103 Abilene, TX  2013          NA                   NA    
12              48059030103 Abilene, TX  2019          NA                   NA    

Upvotes: 4

Related Questions