Match columns with several conditions efficiently

Question

I have a big dataframe in which I have two columns containing dates. For each individual (id), date.x contains a unique date but date.y contains several dates. I would like to match the dates according to several conditions, so that I have one row per individual with one date.x and one date.y. By order of preference:

if there is date.y so that date.x = date.y, then I want to keep only this date.y.
if there is date.y so that date.x > date.y, then I want to keep only the closest value of date.y that is smaller than date.x.
if there isn't date.y so that date.x > date.y (i.e if all date.y are greater than date.x), then I want to keep only the closest value of date.y that is greater than date.x.

Below is a reproducible example:

library(tibble)

test <- tibble(
  group = rep(c("A", "B"), each = 6),
  id = rep(rep(1:2, each = 3), 2),
  date.x = rep(c(2002, 2008), each = 6),
  date.y = c(1998, 2000, 2004, 1997, 1998, 1999, 2006, 2008, 2010, 2009, 2010, 2011)
)

# Current form:
# A tibble: 12 x 4
   group    id date.x date.y
        
 1 A         1   2002   1998
 2 A         1   2002   2000
 3 A         1   2002   2004
 4 A         2   2002   1997
 5 A         2   2002   1998
 6 A         2   2002   1999
 7 B         1   2008   2006
 8 B         1   2008   2008
 9 B         1   2008   2010
10 B         2   2008   2009
11 B         2   2008   2010
12 B         2   2008   2011

# expected output:

# A tibble: 4 x 4
  group    id date.x date.y
       
1 A         1   2002   2000
2 A         2   2002   1999
3 B         1   2008   2008
4 B         2   2008   2009

Also, since my dataframe is quite big and makes my RStudio session crash sometimes, I need a solution which is the most economical in memory.

How can I do that?

Junitar · Accepted Answer

Here is a possibility.

library(dplyr)

test %>% 
  group_by(id, date.x) %>% 
  filter(
    date.x == date.y |
    date.x - date.y == min(abs(date.x - date.y)) & date.x > date.y |
    date.x + date.y == min(date.x + date.y) & date.x < date.y
  )

# A tibble: 4 x 4
# Groups:   id, date.x [4]
  group    id date.x date.y
       
1 A         1   2002   2000
2 A         2   2002   1999
3 B         1   2008   2008
4 B         2   2008   2009

Match columns with several conditions efficiently

Answers (2)

Related Questions