Reputation: 587
I am trying to do the following. I have a dataset1 that goes from 2015-01-31 up until 2021-06-30:
dataset1_dates=c("2015-01-31","2015-02-28","2015-03-31","2015-04-30","2015-05-31","2015-06-30","2015-07-31","2015-08-31","2015-09-30","2015-10-31","2015-11-30","2015-12-31","2016-01-31","2016-02-29","2016-03-31","2016-04-30","2016-05-31","2016-06-30","2016-07-31","2016-08-31","2016-09-30","2016-10-31","2016-11-30","2016-12-31","2017-01-31","2017-02-28","2017-03-31","2017-04-30","2017-05-31","2017-06-30","2017-07-31","2017-08-31","2017-09-30","2017-10-31","2017-11-30","2017-12-31","2018-01-31","2018-02-28","2018-03-31","2018-04-30","2018-05-31","2018-06-30","2018-07-31","2018-08-31","2018-09-30","2018-10-31","2018-11-30","2018-12-31","2019-01-31","2019-02-28","2019-03-31","2019-04-30","2019-05-31","2019-06-30","2019-07-31","2019-08-31","2019-09-30","2019-10-31","2019-11-30","2019-12-31","2020-01-31","2020-02-29","2020-03-31","2020-04-30","2020-05-31","2020-06-30","2020-07-31","2020-08-31","2020-09-30","2020-10-31","2020-11-30","2020-12-31","2021-01-31","2021-02-28","2021-03-31","2021-04-30","2021-05-31","2021-06-30")
# add dates
dataset1 <- expand.grid(Organisation = c("A123","B234","C456"),
Date = dataset1_dates)
## sort
dataset1 <- dataset1[order(dataset1$Organisation, dataset1$Date),]
## reset id
rownames(dataset1) <- NULL
dataset1$Organisation <- as.character(dataset1$Organisation)
dataset1$Date <- as.Date(dataset1$Date, format="%Y-%m-%d")
Then I have a dataset2 that tells me at specific points in time the performance of each organisation at the time of inspection:
dataset2 <- read.table(
text = "
Organisation Date_inspection Performance
A123 2015-01-31 Good
A123 2016-01-14 OK
B234 2017-06-14 Inadequate
C456 2015-06-30 OK
C456 2016-02-10 Inspected but not rated
C456 2018-05-18 Good
C456 2020-03-21 OK",
header = TRUE)
dataset2$Organisation <- as.character(dataset2$Organisation)
dataset2$Date_inspection <- as.Date(dataset2$Date_inspection, format="%Y-%m-%d")
dataset2$Performance <- as.character(dataset2$Performance)
I would like to assign to each month after inspection, including the month of the inspection, the performance category of the organisation.
I would also like to consider months prior to the 1st inspection as equal to the performance category at date of 1st inspection.
Expected result:
Date | Organisation | Performance |
2015-01-31 | A123 | Good |
2015-02-28 | A123 | Good |
2015-03-31 | A123 | Good |
...
2016-01-31 | A123 | OK |
...
2021-06-30 | A123 | OK |
2015-01-31 | B234 | Inadequate |
2015-02-28 | B234 | Inadequate |
2015-03-31 | B234 | Inadequate |
...
2021-06-30 | B234 | Inadequate |
2015-01-31 | C456 | OK |
2015-02-28 | C456 | OK |
2015-03-31 | C456 | OK |
...
2016-02-29 | C456 | OK |
...
2018-05-31 | C456 | Good |
2018-06-30 | C456 | Good |
...
2020-03-31 | C456 | OK |
...
2021-06-30 | C456 | OK |
Any ideas on how to do this in R?
Upvotes: 0
Views: 245
Reputation: 5798
Notes on how to remedy your broken read.table()
expression:
Currently the whitespace in the values of your Performance column is causing an error when parsing it into the table. An easy remedy is to recode your values prior to importing it as follows (note that the whitespace in the string "Inspected but not rated", has been replaced with "_", resulting the value "Inspected_but_not_rated").
dataset2 <- read.table(
text = "
Organisation Date_inspection Performance
A123 2015-01-31 Good
A123 2016-01-14 OK
B234 2017-06-14 Inadequate
C456 2015-06-30 OK
C456 2016-02-10 Inspected_but_not_rated
C456 2018-05-18 Good
C456 2020-03-21 OK",
header = TRUE)
We can now insert the whitespace back into the string as follows:
dataset2$Performance <- with(
dataset2,
gsub("_", " ", Performance)
)
Resulting in the dataset2
object you see below.
Base R (amended) solution (in line with additional requests in comments):
Firstly your dataset2
data.frame
object is now broken so we will start using:
dataset2 <- structure(list(Organisation = c("A123", "A123", "B234", "C456",
"C456", "C456", "C456"), Date_inspection = structure(c(16466,
16814, 17331, 16616, 16841, 17669, 18342), class = "Date"), Performance = c("Good",
"OK", "Inadequate", "OK", "Inspected but not rated", "Good",
"OK")), row.names = c(NA, -7L), class = "data.frame")
Secondly, all we have to do in this amended case is re-code "Inspected but not rated" to NA_character_
. Please see the revised solution below:
# Recode Inspected but not rated to an NA of type
# character: clean_df2 => data.frame
clean_df2 <- transform(
dataset2,
Performance = gsub(
"Inspected but not rated",
NA_character_,
Performance
)
)
# Expand the "dataset2" to months which the ratings
# are considered applicable over:
# inspectionsApplicable => data.frame
inspectionsApplicable <- unique(
data.frame(
do.call(
rbind,
lapply(
with(
clean_df2,
split(
clean_df2,
Organisation
)
),
function(x){
x$Month_inspected <- as.Date(
strftime(
x$Date_inspection,
"%Y-%m-01"
)
)
x$MinMonthInData <- as.Date(
strftime(
min(
dataset1$Date[
match(
x$Organisation,
dataset1$Organisation
)
]
),
"%Y-%m-01"
)
)
data.frame(
Organisation = c(
x$Organisation[1],
x$Organisation
),
Months = c(
as.Date(unique(x$MinMonthInData)),
as.Date(x$Month_inspected, "%Y-%m-%d")
),
Performance = c(
x$Performance[
which.max(
!(
is.na(
x$Performance
)
)
)
],
x$Performance
)
)
}
)
),
row.names = NULL
)
)
# Left join the tables, and forward fill,
# the inspection category: ir_res => data.frame
res <- within(
merge(
transform(
with(
dataset1,
dataset1[order(Organisation, Date),]
),
Months = as.Date(
strftime(
Date,
"%Y-%m-01"
)
)
),
inspectionsApplicable,
by = c(
"Organisation",
"Months"
),
all.x = TRUE
),
{
Performance <- na.omit(
Performance
)[
cumsum(
!(
is.na(
Performance
)
)
)
]
rm(Months)
}
)
Base R (Original) solution:
# Expand the "dataset2" to months which the ratings
# are considered applicable over:
# inspectionsApplicable => data.frame
inspectionsApplicable <- unique(
data.frame(
do.call(
rbind,
lapply(
with(
dataset2,
split(
dataset2,
Organisation
)
),
function(x){
x$Month_inspected <- as.POSIXlt(
strftime(
x$Date_inspection,
"%Y-%m-01"
)
)
x$MinMonthInData <- as.Date(
strftime(
min(
dataset1$Date[
match(
x$Organisation,
dataset1$Organisation
)
]
),
"%Y-%m-01"
)
)
data.frame(
Organisation = c(
x$Organisation[1],
x$Organisation
),
Months = c(
as.Date(unique(x$MinMonthInData)),
as.Date(x$Month_inspected, "%Y-%m-%d")
),
Performance = c(
x$Performance[1],
x$Performance
)
)
}
)
),
row.names = NULL
)
)
# Left join the tables, and forward fill,
# the inspection category: ir_res => data.frame
res <- transform(
merge(
transform(
with(
dataset1,
dataset1[order(Date),]
),
Months = as.Date(
strftime(
Date,
"%Y-%m-01"
)
)
),
inspectionsApplicable,
by = c(
"Organisation",
"Months"
),
all.x = TRUE
),
Performance = na.omit(
Performance
)[
cumsum(
!(
is.na(
Performance
)
)
)
]
)
Upvotes: 2
Reputation: 9678
Here's an approach using dplyr. Note that this requires Organisation
to be character
in both datasets (i.e. no conversion using as.factor
).
lookup <- function(x, y) {
dataset2 %>%
filter(Organisation == x, Date_inspection <= y) %>%
pull(Performance) %>%
last(
default = dataset2 %>%
filter(Organisation == x) %>%
slice_min(Date_inspection) %>%
pull(Performance)
)
}
# add `Performance` by applying `lookup` over `organisation` and `Date`
dataset1 %>%
mutate(Performance = map2_chr(Organisation, Date, lookup))
The idea is to use a function lookup
that pull
s the Performance
value for the organisation's last inspection. If that value doesn't exist (because there is no Date_inspection <= y
) we use the first inspection date for that organisation.
Upvotes: 3
Reputation: 627
You are on the right way, but neither your dataset1
nor dataset2
contain the column Location
.
If this column is missing in your main data too, then this may be the first step to look into.
I can edit my answer if needed.
Upvotes: 0