Reputation: 1
For each ID in one table (Table 1), how can I count the number of rows in another table (Table 2) which fall between a start date and end date (two columns in the first table) in R?
Table 1:
contact_number | start_date | end_date |
---|---|---|
123 | 01/01/2023 | 01/01/2024 |
1234 | 01/01/2022 | 01/01/2023 |
Table 2:
contact_number | visit_type | date |
---|---|---|
123 | 1 | 02/01/2023 |
123 | 2 | 05/01/2023 |
1234 | 2 | 05/01/2022 |
Output wanted:
contact_number | start_date | end_date | visits |
---|---|---|---|
123 | 01/01/2023 | 01/01/2024 | 2 |
1234 | 01/01/2022 | 01/01/2023 | 1 |
I am wanting to replicate a Power BI DAX query:
visits_per_membership =
CALCULATE(
COUNTROWS('Table 2'),
FILTER(
'Table 2',
'Table 2'[contact_number] = Table 1[contact_number] && 'Table 2'[EntryDate].[Date] >= Table 1[start_date].[Date] && 'Table 2'[EntryDate].[Date] < Table 1[renewal_date].[Date])
)
Upvotes: 0
Views: 321
Reputation: 17204
Base R solution, mapping over rows of table1
and comparing to table2
by id and dates:
table1$visits <- mapply(
\(contact, start, end) sum(
table2$contact_number == contact &
table2$date >= start &
table2$date <= end
),
table1$contact_number,
table1$start_date,
table1$end_date
)
table1
contact_number start_date end_date visits
1 123 2023-01-01 2024-01-01 2
2 1234 2022-01-01 2023-01-01 1
Or using dplyr::left_join()
:
library(dplyr) # >= v1.1.0:
table1 %>%
left_join(table2, multiple = "all") %>%
summarize(
visits = sum(between(date, start_date, end_date)),
.by = c(contact_number, start_date, end_date)
)
Upvotes: 0
Reputation: 27762
data.table
approach
dt1[, visits := dt2[dt1, on = .(date >= start_date, date <= end_date)][, .N, by = contact_number]$N]
contact_number start_date end_date visits
1: 123 2023-01-01 2024-01-01 2
2: 1234 2022-01-01 2023-01-01 1
sample data used
dt1 <- fread("contact_number start_date end_date
123 01/01/2023 01/01/2024
1234 01/01/2022 01/01/2023")
dt2 <- fread("contact_number visit_type date
123 1 02/01/2023
123 2 05/01/2023
1234 2 05/01/2022")
dt1[, c("start_date", "end_date") := lapply(.SD, as.Date, format = "%d/%m/%Y"),
.SDcols = c("start_date", "end_date")]
dt2[, date := as.Date(date, format = "%d/%m/%Y")]
Upvotes: 0