Kirk Fogg
Kirk Fogg

Reputation: 531

Filtering dataset on closest date for multiple variables

I'm working with data where each subject has several variables, and a date on which each variable was measured. The variables are measured multiple times for each subject.

I have a separate file with, say, "target dates" for each subject. I want to filter the original dataset and include the only the variable measurements closest to the target date for each subject.

Here's a simple example:

library(dplyr)

set.seed(1234)

data <- data.frame(dates = sample(seq(as.Date('1999/01/01'), 
as.Date('2010/01/01'), by="day"), 200), 
                   ids = rep(1:50, 2))
data <- arrange(data, ids)
data$vars <- rep(1:2, 100)

target.date <- data.frame(dates = sample(seq(as.Date('1999/01/01'), 
as.Date('2010/01/01'), by="day"), 50), 
                          ids = rep(1:50))

For each subject, there are two variables each measured twice. I want to include only the measurement of each variable closest to the date in target.date.

I tried using dplyr like this:

data.sub <- data %>%
  group_by(id, vars) %>%

but I'm not sure how to proceed, since I somehow have to reference the date in target.date associated with each of the ids. I'd also like to choose the closest date without going over the target date, but may not be possible for some subjects in this example I created.

Here is a snippet of the output I'm trying to get at:

       dates  ids  vars
1 2000-04-01    1     1
2 2003-08-26    1     2
3 2005-01-22    2     1
4 2002-05-13    2     2
.
.
.

For subject 2, both dates for vars==1 are after the target date. But I included it in anyway for simplicity. But if possible, I'd like to take the dates closest to the target date without going over it.

Upvotes: 0

Views: 873

Answers (1)

Gopala
Gopala

Reputation: 10483

Here is what I think you need. It is not very clear from your question and you did not provide expected output given the sample data. I can modify based on your feedback.

data <- inner_join(data, target.date, by = 'ids')
data$days <- difftime(data$dates.x, data$dates.y, units = 'days')
data$days <- abs(as.integer(data$days))
data %>%
  group_by(ids, vars) %>%
  filter(days == min(days)) %>%
  slice(1)

Upvotes: 1

Related Questions