Reputation: 169
I have data for every census tract in the country at three time points (2000, 2013, 2019) that is grouped by CBSA. I'm trying to create a new variable called cont_chg_pedu_colplus that is the difference between the 2013 and 2000 value for pedu_colplus. So, in my example below, I want to create a new column called cont_chg_pedu_colplus that returns the value of 3.0 (14.6 - 11.6). Ideally, each group of tracts would have the same value, since I'm only interested in the difference between time 1 and time 2.
tractid year CBSA_name pedu_colplus
<chr> <dbl> <chr> <dbl>
1 48059030101 2000 Abilene, TX 11.6
2 48059030101 2013 Abilene, TX 14.6
3 48059030101 2019 Abilene, TX 20.6
4 48059030102 2000 Abilene, TX 11.6
5 48059030102 2013 Abilene, TX 14.2
6 48059030102 2019 Abilene, TX 20.2
Below is the code I have so far. It throws the following error, I think because I'm subsetting on just one year (37 rows instead of the 111 in the dataset). I'd prefer not to make my data wide, because I've got a bunch of other data manipulations I have to. I couldn't get lag to work.
gent_vars_prelim <- outcome_data %>%
mutate(cont_chg_pedu_colplus = pedu_colplus[year == 2013] - pedu_colplus[year == 2000], na.rm = TRUE) %>%
glimpse()
Problem with
mutate()
inputcont_chg_pedu_colplus
. x Inputcont_chg_pedu_colplus
can't be recycled to size 37. ℹ Inputcont_chg_pedu_colplus
ispedu_colplus[year == 2013] - pedu_colplus[year == 2000]
. ℹ Inputcont_chg_pedu_colplus
must be size 37 or 1, not 0. ℹ The error occurred in group 1: CBSA_name = "Abilene, TX", year = 2000
Any thoughts? Thanks.
Upvotes: 0
Views: 1119
Reputation: 275
I'll assume that for each unique pair of tractid
and CBSA_name
, there are up to 3 entries for year
(possible values 2000, 2013, or 2019) and no two years are the same for a given pair of tractid
and CBSA_name
.
First, we'll group the values in the data frame by tractid
and CBSA_name
. Each group will have up to 3 rows, one for each year. We do this with dplyr::group_by(tractid, CBSA_name)
.
Next, we'll force the group to have all 3 years. We do this with tidyr::complete(year = c(2000, 2013, 2019))
as you suggested in the comments. (This is better than my comment using filter(n() == 3)
, because we actually wouldn't care if only 2019 were missing, and we are able to preserve incomplete groups.)
Then, we can compute the result you're interested in: dplyr::mutate(cont_chg_pedu_colplus = pedu_colplus[year == 2013] - pedu_colplus[year == 2000])
. We just need to dplyr::ungroup()
after this and we're done.
Final code:
gent_vars_prelim <- outcome_data %>%
dplyr::group_by(tractid, CBSA_name) %>%
tidyr::complete(year = c(2000, 2013, 2019)) %>%
dplyr::mutate(cont_chg_pedu_colplus = pedu_colplus[year == 2013] - pedu_colplus[year == 2000]) %>%
dplyr::ungroup() %>%
glimpse()
Test case:
outcome_data <- data.frame(tractid = c(48059030101, 48059030101, 48059030101, 48059030101, 48059030101, 48059030101, 48059030102, 48059030102, 48059030102, 48059030103),
year = c(2000, 2013, 2019, 2000, 2013, 2019, 2000, 2013, 2019, 2000),
CBSA_name = c("Abilene, TX", "Abilene, TX", "Abilene, TX", "Austin, TX", "Austin, TX", "Austin, TX", "Abilene, TX", "Abilene, TX", "Abilene, TX", "Abilene, TX"),
pedu_colplus = c(11.6, 14.6, 20.6, 8.4, 9.0, 9.6, 11.6, 14.2, 20.2, 4.0))
Result:
> tibble(gent_vars_prelim)
# A tibble: 12 x 1
gent_vars_prelim$tractid $CBSA_name $year $pedu_colplus $cont_chg_pedu_colplus
<dbl> <fct> <dbl> <dbl> <dbl>
1 48059030101 Abilene, TX 2000 11.6 3
2 48059030101 Abilene, TX 2013 14.6 3
3 48059030101 Abilene, TX 2019 20.6 3
4 48059030101 Austin, TX 2000 8.4 0.600
5 48059030101 Austin, TX 2013 9 0.600
6 48059030101 Austin, TX 2019 9.6 0.600
7 48059030102 Abilene, TX 2000 11.6 2.60
8 48059030102 Abilene, TX 2013 14.2 2.60
9 48059030102 Abilene, TX 2019 20.2 2.60
10 48059030103 Abilene, TX 2000 4 NA
11 48059030103 Abilene, TX 2013 NA NA
12 48059030103 Abilene, TX 2019 NA NA
Upvotes: 4