Making R count the number of strings inside an element

Question

df <- structure(list(ID = c("1", "2", "3", "4", "5", "6"), Column1 = c(NA_character_, 
NA_character_, NA_character_, NA_character_, NA_character_, NA_character_
), Column2 = c("2011", "2015", "2015", "2006, 2006, 2005, 2005, 2007", 
"2014, 2011", "2007"), `Cut-Off` = c("2011", "2015", "2015", 
"2005", "2011", "2007"), `2005` = c(NA, NA, NA, "30", "18", NA
), `2006` = c(NA_character_, NA_character_, NA_character_, NA_character_, 
NA_character_, NA_character_), `2007` = c("15", NA, "18", NA, 
"30, 18", NA), `2008` = c("16", NA, NA, "30, 27", "18, 30", NA
), `2009` = c("15", NA, NA, "20", "30, 18", NA), `2010` = c(NA, 
NA, NA, "30, 20", NA, NA), `2011` = c(NA_character_, NA_character_, 
NA_character_, NA_character_, NA_character_, NA_character_), 
    `2012` = c(NA, NA, NA, "20, 30", NA, "26"), `2013` = c("15", 
    NA, "19", NA, NA, NA), `2014` = c(NA, NA, "18", NA, NA, NA
    ), `2015` = c(NA, NA, "18", NA, "18", NA), `2016` = c(NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_)), .Names = c("ID", "Column1", "Column2", "Cut-Off", 
"2005", "2006", "2007", "2008", "2009", "2010", "2011", "2012", 
"2013", "2014", "2015", "2016"), row.names = c(NA, 6L), class = "data.frame")

Given the data frame above. What I would like R to do, is to look at the cut-off year (column 4) and then create 2 new column at the end of the data frame, where one column has the total number of unique "identifiers" inside each element for each year before the cut-off year and the other column has the total numbers after the cut-off year. The identifiers in the cut-off year columns should not be included.

The data frame below shows the desired output.

For example, in the first row, the cut-off year is 2011, and the years 2007, 2008, and 2009 which are before the cut-off year have the identifiers 15, 16, and 15 respectively. So the unique number of identifiers is 15 and 16 (the second 15 is removed) and then it counts "2" in the "Before" column. After the cut-off year, only 2013 has an identifier so it counts "1" in the "After" column.

If there are 2 or more identifiers in one element (for instance in row 4 and 5 where it says "30, 27" or "30, 18") then it should still be treated as identifiers separated by the comma.

df_solution <- structure(list(ID = c("1", "2", "3", "4", "5", "6"), Column1 = c(NA_character_, 
NA_character_, NA_character_, NA_character_, NA_character_, NA_character_
), Column2 = c("2011", "2015", "2015", "2006, 2006, 2005, 2005, 2007", 
"2014, 2011", "2007"), `Cut-Off` = c("2011", "2015", "2015", 
"2005", "2011", "2007"), `2005` = c(NA, NA, NA, "30", "18", NA
), `2006` = c(NA_character_, NA_character_, NA_character_, NA_character_, 
NA_character_, NA_character_), `2007` = c("15", NA, "18", NA, 
"30, 18", NA), `2008` = c("16", NA, NA, "30, 27", "18, 30", NA
), `2009` = c("15", NA, NA, "20", "30, 18", NA), `2010` = c(NA, 
NA, NA, "30, 20", NA, NA), `2011` = c(NA_character_, NA_character_, 
NA_character_, NA_character_, NA_character_, NA_character_), 
    `2012` = c(NA, NA, NA, "20, 30", NA, "26"), `2013` = c("15", 
    NA, "19", NA, NA, NA), `2014` = c(NA, NA, "18", NA, NA, NA
    ), `2015` = c(NA, NA, "18", NA, "18", NA), `2016` = c(NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_), Before = c(2, 0, 2, 0, 2, 0), After = c(1, 
    0, 0, 3, 1, 1)), .Names = c("ID", "Column1", "Column2", "Cut-Off", 
"2005", "2006", "2007", "2008", "2009", "2010", "2011", "2012", 
"2013", "2014", "2015", "2016", "Before", "After"), row.names = c(NA, 
6L), class = "data.frame")

AntoniosK · Accepted Answer

library(tidyverse)

df %>% 
  select(-Column1, - Column2) %>%         # remove those columns
  gather(year,value,-ID, -`Cut-Off`) %>%  # reshape data
  na.omit() %>%                           # remove rows with NA
  separate_rows(value) %>%                # split values (using commas)
  group_by(ID, `Cut-Off`) %>%             # for each ID and cut-off
  summarise(Before = n_distinct(value[as.numeric(`Cut-Off`) > as.numeric(year)]),     # count distinct values where cut-off is after the dates
            After = n_distinct(value[as.numeric(`Cut-Off`) < as.numeric(year)])) %>%  # count distinct values where cut-off is before the dates
  ungroup()  %>%                     # forget the grouping
  select(-`Cut-Off`) %>%             # remove cut-off column
  right_join(df, by="ID") %>%        # join back original dataset
  mutate_at(vars(Before,After), ~coalesce(.,0L))  # replace NAs with 0 in those two columns


# # A tibble: 6 x 18
# ID    Before After Column1 Column2      `Cut-Off` `2005` `2006` `2007` `2008` `2009` `2010` `2011` `2012`
#                                      
# 1 1          2     1 NA      2011         2011      NA     NA     15     16     15     NA     NA     NA    
# 2 2          0     0 NA      2015         2015      NA     NA     NA     NA     NA     NA     NA     NA    
# 3 3          2     0 NA      2015         2015      NA     NA     18     NA     NA     NA     NA     NA    
# 4 4          0     3 NA      2006, 2006,~ 2005      30     NA     NA     30, 27 20     30, 20 NA     20, 30
# 5 5          2     1 NA      2014, 2011   2011      18     NA     30, 18 18, 30 30, 18 NA     NA     NA    
# 6 6          0     1 NA      2007         2007      NA     NA     NA     NA     NA     NA     NA     26    
# # ... with 4 more variables: `2013` , `2014` , `2015` , `2016`

Making R count the number of strings inside an element

Answers (1)

Related Questions