Reputation: 89
I have a df that looks like this:
ID Year
5 2010
5 2011
5 2014
3 2013
3 2014
10 2013
1 2010
1 2012
1 2014
...
The df contains the years 2009-2019, and is filtered on individuals living in a one particular town, that are 18-64 years old at that particular year.
For every year I need to keep only individuals that have moved into this town that particular year. So for example, I need to keep the difference between the population at year 2010 minus the population at year 2009. I also need to do this for every year (so for example, some people move out of town for a couple of years and then return - ID 5 is an example of this). In the end, I want one df for every year 2010-2019, so ten dfs that contain only individuals that moved into town that particular year.
I have played around with group_by()
and left_join()
, but haven't managed to succeed. There must be a simple solution, but I haven't been able to find one yet.
Upvotes: 0
Views: 53
Reputation: 36
You can use the setdiff
function to perform set(A) - set(B) operation. Split your data into dataframes by year, and then loop through them, finding the new joiners.
Example code:
library(dplyr)
set.seed(123)
df <- tibble(
id = c(1, 2, 3, 4, 5, # first year
1, 2, 3, 5, 6, 7, # 4 moves out, 6,7 move in
2, 3, 4, 6, 7, 8), # 1,5 moves out, 4,8 move in
year = c(rep(2009, 5),
rep(2010, 6),
rep(2011, 6)),
age = sample(18:64, size = 17) # extra column
)
# split into list of dataframes by year
df_by_year <- split(df, df$year)
# create a list to contain the 2 df (total years 3 - 1)
df_list <- vector("list", 2)
for(i in 1:length(df_list)){
# determine incoming new people
new_joinees <- setdiff(df_by_year[[i+1]]$id, df_by_year[[i]]$id)
# filter for above IDs
df_list[[i]] <- dplyr::filter(df_by_year[[i+1]], id %in% new_joinees)
}
Upvotes: 1