Reputation: 587

How to merge two datasets according to DATE in R - with code

I am trying to do the following. I have a dataset1 that goes from 2015-01-31 up until 2021-06-30:

dataset1_dates=c("2015-01-31","2015-02-28","2015-03-31","2015-04-30","2015-05-31","2015-06-30","2015-07-31","2015-08-31","2015-09-30","2015-10-31","2015-11-30","2015-12-31","2016-01-31","2016-02-29","2016-03-31","2016-04-30","2016-05-31","2016-06-30","2016-07-31","2016-08-31","2016-09-30","2016-10-31","2016-11-30","2016-12-31","2017-01-31","2017-02-28","2017-03-31","2017-04-30","2017-05-31","2017-06-30","2017-07-31","2017-08-31","2017-09-30","2017-10-31","2017-11-30","2017-12-31","2018-01-31","2018-02-28","2018-03-31","2018-04-30","2018-05-31","2018-06-30","2018-07-31","2018-08-31","2018-09-30","2018-10-31","2018-11-30","2018-12-31","2019-01-31","2019-02-28","2019-03-31","2019-04-30","2019-05-31","2019-06-30","2019-07-31","2019-08-31","2019-09-30","2019-10-31","2019-11-30","2019-12-31","2020-01-31","2020-02-29","2020-03-31","2020-04-30","2020-05-31","2020-06-30","2020-07-31","2020-08-31","2020-09-30","2020-10-31","2020-11-30","2020-12-31","2021-01-31","2021-02-28","2021-03-31","2021-04-30","2021-05-31","2021-06-30")
# add dates
dataset1 <- expand.grid(Organisation = c("A123","B234","C456"),
                       Date = dataset1_dates)
  
## sort
dataset1 <- dataset1[order(dataset1$Organisation, dataset1$Date),]
## reset id
rownames(dataset1) <- NULL

dataset1$Organisation <- as.character(dataset1$Organisation)
dataset1$Date <- as.Date(dataset1$Date, format="%Y-%m-%d")

Then I have a dataset2 that tells me at specific points in time the performance of each organisation at the time of inspection:

dataset2 <- read.table(
  text = "
Organisation    Date_inspection     Performance
A123            2015-01-31          Good
A123            2016-01-14          OK
B234            2017-06-14          Inadequate
C456            2015-06-30          OK
C456            2016-02-10          Inspected but not rated
C456            2018-05-18          Good
C456            2020-03-21          OK",
  header = TRUE)

dataset2$Organisation <- as.character(dataset2$Organisation)
dataset2$Date_inspection <- as.Date(dataset2$Date_inspection, format="%Y-%m-%d")
dataset2$Performance <- as.character(dataset2$Performance)

I would like to assign to each month after inspection, including the month of the inspection, the performance category of the organisation.

I would also like to consider months prior to the 1st inspection as equal to the performance category at date of 1st inspection.

Expected result:

Date        |   Organisation    |     Performance     |
2015-01-31  |   A123            |     Good            |
2015-02-28  |   A123            |     Good            |
2015-03-31  |   A123            |     Good            |
...
2016-01-31  |   A123            |     OK              |
...
2021-06-30  |   A123            |     OK              |
2015-01-31  |   B234            |     Inadequate      |
2015-02-28  |   B234            |     Inadequate      |
2015-03-31  |   B234            |     Inadequate      |
...
2021-06-30  |   B234            |     Inadequate      |
2015-01-31  |   C456            |     OK              |
2015-02-28  |   C456            |     OK              |
2015-03-31  |   C456            |     OK              |
...
2016-02-29  |   C456            |     OK              |
...
2018-05-31  |   C456            |     Good            |
2018-06-30  |   C456            |     Good            |
...
2020-03-31  |   C456            |     OK              |
...
2021-06-30  |   C456            |     OK              |

Any ideas on how to do this in R?

Upvotes: 0

Answers (3)

hello_friend

Reputation: 5798

Notes on how to remedy your broken read.table() expression: Currently the whitespace in the values of your Performance column is causing an error when parsing it into the table. An easy remedy is to recode your values prior to importing it as follows (note that the whitespace in the string "Inspected but not rated", has been replaced with "_", resulting the value "Inspected_but_not_rated").

dataset2 <- read.table(
  text = "
Organisation    Date_inspection     Performance
A123            2015-01-31          Good
A123            2016-01-14          OK
B234            2017-06-14          Inadequate
C456            2015-06-30          OK
C456            2016-02-10          Inspected_but_not_rated
C456            2018-05-18          Good
C456            2020-03-21          OK",
  header = TRUE)

We can now insert the whitespace back into the string as follows:

dataset2$Performance <- with(
  dataset2, 
  gsub("_", " ", Performance)
)

Resulting in the dataset2 object you see below.

Base R (amended) solution (in line with additional requests in comments): Firstly your dataset2 data.frame object is now broken so we will start using:

dataset2 <- structure(list(Organisation = c("A123", "A123", "B234", "C456", 
"C456", "C456", "C456"), Date_inspection = structure(c(16466, 
16814, 17331, 16616, 16841, 17669, 18342), class = "Date"), Performance = c("Good", 
"OK", "Inadequate", "OK", "Inspected but not rated", "Good", 
"OK")), row.names = c(NA, -7L), class = "data.frame")

Secondly, all we have to do in this amended case is re-code "Inspected but not rated" to NA_character_. Please see the revised solution below:

# Recode Inspected but not rated to an NA of type 
# character: clean_df2 => data.frame
clean_df2 <- transform(
  dataset2,
  Performance = gsub(
    "Inspected but not rated",
    NA_character_,
    Performance
    )
)

# Expand the "dataset2" to months which the ratings
# are considered applicable over: 
# inspectionsApplicable => data.frame
inspectionsApplicable <- unique(
  data.frame(
    do.call(
      rbind, 
      lapply(
        with(
          clean_df2,
          split(
            clean_df2, 
            Organisation
          )
        ),
        function(x){
          x$Month_inspected <- as.Date(
            strftime(
              x$Date_inspection,
              "%Y-%m-01"
            )
          )
          x$MinMonthInData <- as.Date(
            strftime(
              min(
                dataset1$Date[
                  match(
                    x$Organisation, 
                    dataset1$Organisation
                  )
                ]
              ), 
              "%Y-%m-01"
            )
          )
          data.frame(
            Organisation = c(
              x$Organisation[1],
              x$Organisation
            ),
            Months = c(
              as.Date(unique(x$MinMonthInData)),
              as.Date(x$Month_inspected, "%Y-%m-%d")
            ),
            Performance = c(
              x$Performance[
                which.max(
                  !(
                    is.na(
                      x$Performance
                      )
                    )
                  )
                ],
              x$Performance
            )
          )
        }
      )
    ),
    row.names = NULL
  )
)

# Left join the tables, and forward fill, 
# the inspection category: ir_res => data.frame
res <- within(
  merge(
    transform(
      with(
        dataset1,
        dataset1[order(Organisation, Date),]
      ), 
      Months = as.Date(
        strftime(
          Date,
          "%Y-%m-01"
        )
      )
    ),
    inspectionsApplicable,
    by = c(
      "Organisation",
      "Months"
    ),
    all.x = TRUE
  ),
  {
    Performance <- na.omit(
      Performance
    )[
      cumsum(
        !(
          is.na(
            Performance
          )
        )
      )
    ]
    rm(Months)
  }
)

Base R (Original) solution:

# Expand the "dataset2" to months which the ratings
# are considered applicable over: 
# inspectionsApplicable => data.frame
inspectionsApplicable <- unique(
  data.frame(
    do.call(
      rbind, 
      lapply(
        with(
          dataset2, 
          split(
            dataset2, 
            Organisation
          )
        ),
        function(x){
          x$Month_inspected <- as.POSIXlt(
            strftime(
              x$Date_inspection,
              "%Y-%m-01"
            )
          )
          x$MinMonthInData <- as.Date(
            strftime(
              min(
                dataset1$Date[
                  match(
                    x$Organisation, 
                    dataset1$Organisation
                    )
                  ]
                ), 
              "%Y-%m-01"
              )
            )
          data.frame(
            Organisation = c(
              x$Organisation[1],
              x$Organisation
            ),
            Months = c(
              as.Date(unique(x$MinMonthInData)),
              as.Date(x$Month_inspected, "%Y-%m-%d")
            ),
           Performance = c(
             x$Performance[1],
             x$Performance
            )
          )
        }
      )
    ),
    row.names = NULL
  )
)

# Left join the tables, and forward fill, 
# the inspection category: ir_res => data.frame
res <- transform(
  merge(
    transform(
      with(
        dataset1,
        dataset1[order(Date),]
      ), 
      Months = as.Date(
        strftime(
          Date,
          "%Y-%m-01"
        )
      )
    ),
    inspectionsApplicable,
    by = c(
      "Organisation",
      "Months"
    ),
    all.x = TRUE
  ),
  Performance = na.omit(
    Performance
    )[
      cumsum(
        !(
          is.na(
            Performance
            )
          )
        )
      ]
)

Upvotes: 2

Martin C. Arnold

Reputation: 9678

Here's an approach using dplyr. Note that this requires Organisation to be character in both datasets (i.e. no conversion using as.factor).

lookup <- function(x, y) {
  dataset2 %>% 
  filter(Organisation == x, Date_inspection <= y) %>% 
  pull(Performance) %>%
  last(
    default = dataset2 %>% 
    filter(Organisation == x) %>% 
    slice_min(Date_inspection) %>% 
    pull(Performance)
  ) 
}

# add `Performance` by applying `lookup` over `organisation` and `Date`
dataset1 %>%
  mutate(Performance = map2_chr(Organisation, Date, lookup))

The idea is to use a function lookup that pulls the Performance value for the organisation's last inspection. If that value doesn't exist (because there is no Date_inspection <= y) we use the first inspection date for that organisation.

Upvotes: 3

Capt.Krusty

Reputation: 627

You are on the right way, but neither your dataset1 nor dataset2 contain the column Location.

If this column is missing in your main data too, then this may be the first step to look into.

I can edit my answer if needed.

Upvotes: 0

How to merge two datasets according to DATE in R - with code

Answers (3)

Related Questions