Reputation: 11
I have a national database of all hospital records, and another national database of infection events. I am looking to extract all relevant hospital events for the infection, but I am struggling to find a way to optmise the query. OtherData
is a proxy for another set of columns.
The infection data looks like this:
UniqueID | PatientNumber | HospitalNumber | Date | OtherData |
---|---|---|---|---|
14000000 | 1234 | BAC | 2022-01-27 | DELTA |
12007927 | 5412 | HSA | 2022-01-20 | OMICRON |
1 | 7862 | UDO | 2020-02-01 | ALPHA |
The hospital data looks like this:
EpisodeID | PatientNumber | HospitalNumber | StartDate | EndDate | OtherData |
---|---|---|---|---|---|
4 | 1234 | 2022-01-25 | NA | ICU | |
987213 | 5412 | 2022-01-20 | 2022-01-27 | DIED | |
3 | BAC | 2021-11-20 | 2022-01-20 | DISCHARGED | |
3 | BAC | 2020-01-29 | 2022-02-10 | DISCHARGED |
The data can be missing lots of fields, and I have two identifiers (national and local) I can use to link the data. I query against both using UNION. But because these are National registers, and I'm dealing with Covid-19 data, we are talking about linking 10s of millions of records (on seperate servers). In order to minimise the amount of hospital data pulled in I am attempting to link between date ranges of the infection.
My query as it stands. It took 8 min pulling a single UniqueID when I tested it, and I have 10s of millions. Im not sure if I should be using and AND (X OR Y) with the dates in the INNER JOIN or if I should have them seperated, and use two more UNIONs.
This data is ingested into R for further processing and analytics.
Help appreciated!
DECLARE @days AS INT = 28;
WITH
infections AS (
SELECT UniqueID
,PatientNumber
,HospitalNumber
,Date
,OtherData
FROM infections
),
link_tbl AS (
SELECT
i.UniqueID
,h.EpisodeID
,h.PatientNumber
,h.HospitalNumber
,h.StartDate
,h.EndDate
FROM infections i
INNER JOIN hospital h
ON i.NHSNumber = h.NHSNumber
AND (h.EndDate
BETWEEN CONVERT(date, DATEADD(DAY, -@days, i.Date))
AND CONVERT(date, DATEADD(DAY, @days, i.Date))
OR h.StartDate
BETWEEN CONVERT(date, DATEADD(DAY, -@days, i.Date))
AND CONVERT(date, DATEADD(DAY, @days, i.Date))
)
UNION
SELECT
i.UniqueID
,h.EpisodeID
,h.NHSNumber
,h.HospitalNumber
,h.StartDate
,h.EndDate
FROM infections i
INNER JOIN hospital h
ON i.HospitalNumber = h.HospitalNumber
AND (h.EndDate
BETWEEN CONVERT(date, DATEADD(DAY, -@days, i.Date))
AND CONVERT(date, DATEADD(DAY, @days, i.Date))
OR h.StartDate
BETWEEN CONVERT(date, DATEADD(DAY, -@days, i.Date))
AND CONVERT(date, DATEADD(DAY, @days, i.Date))
)
)
SELECT
hospital.allothervars* ---(these are named)
,infections.* ---named in query
FROM hospital
INNER JOIN link_tbl ON hospital.EpisodeID = link_tbl.EpisodeID
Upvotes: 1
Views: 53
Reputation: 104
I deal with data on a similar scale with similar use cases. One thing I found is that filtering by an unindexed date is VERY slow. We got around it by finding the index for a specific date and using that as a subquery.
So as an example :
SELECT Id,
Date,
Value,
FROM Table
WHERE Id >= (SELECT MIN(Id) FROM Table WHERE Date = '01-01-1970')
rather than doing
SELECT Id,
Date,
Value,
FROM Table
WHERE Date >= '01-01-1970'
Indexing datetime data creates a pretty big file size, so most of the time it's not indexed. The SQL Execution engine can do a much better job sorting on primary keys rather than datetimes.
What works for me is generating a big list of the min and max IDs for a date range grouped by the day. Then whenever I need to do an ad hoc request or do something that has a bound StartDate but not an EndDate, you can just hard code the ID and do a greater than operator for your filter.
Upvotes: 2