Alex Bhattacharya
Alex Bhattacharya

Reputation: 11

Combining WITH, UNION and OR in JOIN

I have a national database of all hospital records, and another national database of infection events. I am looking to extract all relevant hospital events for the infection, but I am struggling to find a way to optmise the query. OtherData is a proxy for another set of columns.

The infection data looks like this:

UniqueID PatientNumber HospitalNumber Date OtherData
14000000 1234 BAC 2022-01-27 DELTA
12007927 5412 HSA 2022-01-20 OMICRON
1 7862 UDO 2020-02-01 ALPHA

The hospital data looks like this:

EpisodeID PatientNumber HospitalNumber StartDate EndDate OtherData
4 1234 2022-01-25 NA ICU
987213 5412 2022-01-20 2022-01-27 DIED
3 BAC 2021-11-20 2022-01-20 DISCHARGED
3 BAC 2020-01-29 2022-02-10 DISCHARGED

The data can be missing lots of fields, and I have two identifiers (national and local) I can use to link the data. I query against both using UNION. But because these are National registers, and I'm dealing with Covid-19 data, we are talking about linking 10s of millions of records (on seperate servers). In order to minimise the amount of hospital data pulled in I am attempting to link between date ranges of the infection.

My query as it stands. It took 8 min pulling a single UniqueID when I tested it, and I have 10s of millions. Im not sure if I should be using and AND (X OR Y) with the dates in the INNER JOIN or if I should have them seperated, and use two more UNIONs.

This data is ingested into R for further processing and analytics.

Help appreciated!

DECLARE @days AS INT = 28;

WITH 
infections AS (
SELECT  UniqueID
  ,PatientNumber
  ,HospitalNumber
  ,Date
  ,OtherData
FROM infections
),
link_tbl AS (
SELECT
  i.UniqueID
  ,h.EpisodeID
  ,h.PatientNumber
  ,h.HospitalNumber
  ,h.StartDate
  ,h.EndDate
FROM infections i
INNER JOIN hospital h
  ON i.NHSNumber = h.NHSNumber
  AND (h.EndDate
    BETWEEN CONVERT(date, DATEADD(DAY, -@days, i.Date)) 
    AND CONVERT(date, DATEADD(DAY, @days, i.Date))
  OR h.StartDate
    BETWEEN CONVERT(date, DATEADD(DAY, -@days, i.Date)) 
    AND CONVERT(date, DATEADD(DAY, @days, i.Date))
  )
UNION

SELECT
  i.UniqueID
  ,h.EpisodeID
  ,h.NHSNumber
  ,h.HospitalNumber
  ,h.StartDate
  ,h.EndDate
FROM infections i
INNER JOIN hospital h
  ON i.HospitalNumber = h.HospitalNumber
  AND (h.EndDate
    BETWEEN CONVERT(date, DATEADD(DAY, -@days, i.Date)) 
    AND CONVERT(date, DATEADD(DAY, @days, i.Date))
  OR h.StartDate
    BETWEEN CONVERT(date, DATEADD(DAY, -@days, i.Date)) 
    AND CONVERT(date, DATEADD(DAY, @days, i.Date))
  )
)

SELECT 
  hospital.allothervars* ---(these are named)
  ,infections.* ---named in query
FROM hospital
INNER JOIN link_tbl ON hospital.EpisodeID = link_tbl.EpisodeID

Upvotes: 1

Views: 53

Answers (1)

Parker.V
Parker.V

Reputation: 104

I deal with data on a similar scale with similar use cases. One thing I found is that filtering by an unindexed date is VERY slow. We got around it by finding the index for a specific date and using that as a subquery.

So as an example :

SELECT Id,
       Date,
       Value, 
FROM Table
WHERE Id >= (SELECT MIN(Id) FROM Table WHERE Date = '01-01-1970')

rather than doing

SELECT Id,
       Date,
       Value, 
FROM   Table
WHERE Date >= '01-01-1970'

Indexing datetime data creates a pretty big file size, so most of the time it's not indexed. The SQL Execution engine can do a much better job sorting on primary keys rather than datetimes.

What works for me is generating a big list of the min and max IDs for a date range grouped by the day. Then whenever I need to do an ad hoc request or do something that has a bound StartDate but not an EndDate, you can just hard code the ID and do a greater than operator for your filter.

Upvotes: 2

Related Questions