bretauv
bretauv

Reputation: 8557

Match columns with several conditions efficiently

I have a big dataframe in which I have two columns containing dates. For each individual (id), date.x contains a unique date but date.y contains several dates. I would like to match the dates according to several conditions, so that I have one row per individual with one date.x and one date.y. By order of preference:

Below is a reproducible example:

library(tibble)

test <- tibble(
  group = rep(c("A", "B"), each = 6),
  id = rep(rep(1:2, each = 3), 2),
  date.x = rep(c(2002, 2008), each = 6),
  date.y = c(1998, 2000, 2004, 1997, 1998, 1999, 2006, 2008, 2010, 2009, 2010, 2011)
)

# Current form:
# A tibble: 12 x 4
   group    id date.x date.y
   <chr> <int>  <dbl>  <dbl>
 1 A         1   2002   1998
 2 A         1   2002   2000
 3 A         1   2002   2004
 4 A         2   2002   1997
 5 A         2   2002   1998
 6 A         2   2002   1999
 7 B         1   2008   2006
 8 B         1   2008   2008
 9 B         1   2008   2010
10 B         2   2008   2009
11 B         2   2008   2010
12 B         2   2008   2011
# expected output:

# A tibble: 4 x 4
  group    id date.x date.y
  <chr> <int>  <dbl>  <dbl>
1 A         1   2002   2000
2 A         2   2002   1999
3 B         1   2008   2008
4 B         2   2008   2009

Also, since my dataframe is quite big and makes my RStudio session crash sometimes, I need a solution which is the most economical in memory.

How can I do that?

Upvotes: 1

Views: 68

Answers (2)

Junitar
Junitar

Reputation: 999

Here is a possibility.

library(dplyr)

test %>% 
  group_by(id, date.x) %>% 
  filter(
    date.x == date.y |
    date.x - date.y == min(abs(date.x - date.y)) & date.x > date.y |
    date.x + date.y == min(date.x + date.y) & date.x < date.y
  )

# A tibble: 4 x 4
# Groups:   id, date.x [4]
  group    id date.x date.y
  <chr> <int>  <dbl>  <dbl>
1 A         1   2002   2000
2 A         2   2002   1999
3 B         1   2008   2008
4 B         2   2008   2009

Upvotes: 1

Ian Campbell
Ian Campbell

Reputation: 24790

Here's a version using case_when:

library(dplyr)
test %>%
  group_by(group,id,date.x) %>%
  summarise(date.y = case_when(unique(date.x) %in% date.y ~ unique(date.x), 
                               any(date.y < unique(date.x)) ~ max(date.y[date.y < unique(date.x)]),
                               all(date.y > unique(date.x)) ~ min(date.y)))
## A tibble: 4 x 4
## Groups:   group, id [4]
#  group    id date.x date.y
#  <chr> <int>  <dbl>  <dbl>
#1 A         1   2002   2000
#2 A         2   2002   1999
#3 B         1   2008   2008
#4 B         2   2008   2009

Upvotes: 1

Related Questions