Reputation: 8557
I have a big dataframe in which I have two columns containing dates. For each individual (id
), date.x
contains a unique date but date.y
contains several dates. I would like to match the dates according to several conditions, so that I have one row per individual with one date.x
and one date.y
. By order of preference:
if there is date.y
so that date.x = date.y
, then I want to keep only this date.y
.
if there is date.y
so that date.x > date.y
, then I want to keep only the closest value of date.y
that is smaller than date.x
.
if there isn't date.y
so that date.x > date.y
(i.e if all date.y
are greater than date.x
), then I want to keep only the closest value of date.y
that is greater than date.x
.
Below is a reproducible example:
library(tibble)
test <- tibble(
group = rep(c("A", "B"), each = 6),
id = rep(rep(1:2, each = 3), 2),
date.x = rep(c(2002, 2008), each = 6),
date.y = c(1998, 2000, 2004, 1997, 1998, 1999, 2006, 2008, 2010, 2009, 2010, 2011)
)
# Current form:
# A tibble: 12 x 4
group id date.x date.y
<chr> <int> <dbl> <dbl>
1 A 1 2002 1998
2 A 1 2002 2000
3 A 1 2002 2004
4 A 2 2002 1997
5 A 2 2002 1998
6 A 2 2002 1999
7 B 1 2008 2006
8 B 1 2008 2008
9 B 1 2008 2010
10 B 2 2008 2009
11 B 2 2008 2010
12 B 2 2008 2011
# expected output:
# A tibble: 4 x 4
group id date.x date.y
<chr> <int> <dbl> <dbl>
1 A 1 2002 2000
2 A 2 2002 1999
3 B 1 2008 2008
4 B 2 2008 2009
Also, since my dataframe is quite big and makes my RStudio session crash sometimes, I need a solution which is the most economical in memory.
How can I do that?
Upvotes: 1
Views: 68
Reputation: 999
Here is a possibility.
library(dplyr)
test %>%
group_by(id, date.x) %>%
filter(
date.x == date.y |
date.x - date.y == min(abs(date.x - date.y)) & date.x > date.y |
date.x + date.y == min(date.x + date.y) & date.x < date.y
)
# A tibble: 4 x 4
# Groups: id, date.x [4]
group id date.x date.y
<chr> <int> <dbl> <dbl>
1 A 1 2002 2000
2 A 2 2002 1999
3 B 1 2008 2008
4 B 2 2008 2009
Upvotes: 1
Reputation: 24790
Here's a version using case_when
:
library(dplyr)
test %>%
group_by(group,id,date.x) %>%
summarise(date.y = case_when(unique(date.x) %in% date.y ~ unique(date.x),
any(date.y < unique(date.x)) ~ max(date.y[date.y < unique(date.x)]),
all(date.y > unique(date.x)) ~ min(date.y)))
## A tibble: 4 x 4
## Groups: group, id [4]
# group id date.x date.y
# <chr> <int> <dbl> <dbl>
#1 A 1 2002 2000
#2 A 2 2002 1999
#3 B 1 2008 2008
#4 B 2 2008 2009
Upvotes: 1