Reputation: 1083
df <- structure(list(ID = c("1", "2", "3", "4", "5", "6"), Column1 = c(NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_, NA_character_
), Column2 = c("2011", "2015", "2015", "2006, 2006, 2005, 2005, 2007",
"2014, 2011", "2007"), `Cut-Off` = c("2011", "2015", "2015",
"2005", "2011", "2007"), `2005` = c(NA, NA, NA, "30", "18", NA
), `2006` = c(NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_), `2007` = c("15", NA, "18", NA,
"30, 18", NA), `2008` = c("16", NA, NA, "30, 27", "18, 30", NA
), `2009` = c("15", NA, NA, "20", "30, 18", NA), `2010` = c(NA,
NA, NA, "30, 20", NA, NA), `2011` = c(NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_),
`2012` = c(NA, NA, NA, "20, 30", NA, "26"), `2013` = c("15",
NA, "19", NA, NA, NA), `2014` = c(NA, NA, "18", NA, NA, NA
), `2015` = c(NA, NA, "18", NA, "18", NA), `2016` = c(NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_)), .Names = c("ID", "Column1", "Column2", "Cut-Off",
"2005", "2006", "2007", "2008", "2009", "2010", "2011", "2012",
"2013", "2014", "2015", "2016"), row.names = c(NA, 6L), class = "data.frame")
Given the data frame above. What I would like R to do, is to look at the cut-off year (column 4) and then create 2 new column at the end of the data frame, where one column has the total number of unique "identifiers" inside each element for each year before the cut-off year and the other column has the total numbers after the cut-off year. The identifiers in the cut-off year columns should not be included.
The data frame below shows the desired output.
For example, in the first row, the cut-off year is 2011, and the years 2007, 2008, and 2009 which are before the cut-off year have the identifiers 15, 16, and 15 respectively. So the unique number of identifiers is 15 and 16 (the second 15 is removed) and then it counts "2" in the "Before" column. After the cut-off year, only 2013 has an identifier so it counts "1" in the "After" column.
If there are 2 or more identifiers in one element (for instance in row 4 and 5 where it says "30, 27" or "30, 18") then it should still be treated as identifiers separated by the comma.
df_solution <- structure(list(ID = c("1", "2", "3", "4", "5", "6"), Column1 = c(NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_, NA_character_
), Column2 = c("2011", "2015", "2015", "2006, 2006, 2005, 2005, 2007",
"2014, 2011", "2007"), `Cut-Off` = c("2011", "2015", "2015",
"2005", "2011", "2007"), `2005` = c(NA, NA, NA, "30", "18", NA
), `2006` = c(NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_), `2007` = c("15", NA, "18", NA,
"30, 18", NA), `2008` = c("16", NA, NA, "30, 27", "18, 30", NA
), `2009` = c("15", NA, NA, "20", "30, 18", NA), `2010` = c(NA,
NA, NA, "30, 20", NA, NA), `2011` = c(NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_),
`2012` = c(NA, NA, NA, "20, 30", NA, "26"), `2013` = c("15",
NA, "19", NA, NA, NA), `2014` = c(NA, NA, "18", NA, NA, NA
), `2015` = c(NA, NA, "18", NA, "18", NA), `2016` = c(NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_), Before = c(2, 0, 2, 0, 2, 0), After = c(1,
0, 0, 3, 1, 1)), .Names = c("ID", "Column1", "Column2", "Cut-Off",
"2005", "2006", "2007", "2008", "2009", "2010", "2011", "2012",
"2013", "2014", "2015", "2016", "Before", "After"), row.names = c(NA,
6L), class = "data.frame")
Upvotes: 2
Views: 65
Reputation: 16121
library(tidyverse)
df %>%
select(-Column1, - Column2) %>% # remove those columns
gather(year,value,-ID, -`Cut-Off`) %>% # reshape data
na.omit() %>% # remove rows with NA
separate_rows(value) %>% # split values (using commas)
group_by(ID, `Cut-Off`) %>% # for each ID and cut-off
summarise(Before = n_distinct(value[as.numeric(`Cut-Off`) > as.numeric(year)]), # count distinct values where cut-off is after the dates
After = n_distinct(value[as.numeric(`Cut-Off`) < as.numeric(year)])) %>% # count distinct values where cut-off is before the dates
ungroup() %>% # forget the grouping
select(-`Cut-Off`) %>% # remove cut-off column
right_join(df, by="ID") %>% # join back original dataset
mutate_at(vars(Before,After), ~coalesce(.,0L)) # replace NAs with 0 in those two columns
# # A tibble: 6 x 18
# ID Before After Column1 Column2 `Cut-Off` `2005` `2006` `2007` `2008` `2009` `2010` `2011` `2012`
# <chr> <int> <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
# 1 1 2 1 NA 2011 2011 NA NA 15 16 15 NA NA NA
# 2 2 0 0 NA 2015 2015 NA NA NA NA NA NA NA NA
# 3 3 2 0 NA 2015 2015 NA NA 18 NA NA NA NA NA
# 4 4 0 3 NA 2006, 2006,~ 2005 30 NA NA 30, 27 20 30, 20 NA 20, 30
# 5 5 2 1 NA 2014, 2011 2011 18 NA 30, 18 18, 30 30, 18 NA NA NA
# 6 6 0 1 NA 2007 2007 NA NA NA NA NA NA NA 26
# # ... with 4 more variables: `2013` <chr>, `2014` <chr>, `2015` <chr>, `2016` <chr>
Upvotes: 2