mslhrt
mslhrt

Reputation: 207

SQL Duplicate Rows Multiple Joins

I'm pretty much a noob when it comes to SQL so any help would be appreciated. I have a large data set that I am filtering through for a hospital. I am pulling data from 6 different tables and one of my tables has duplicate rows for each visit. I only want to pull in one row for each visit (it doesn't matter which row is pulled in). I know I need to use a DISTINCT, or GROUP BY clause but my syntax must be wrong.

    SELECT
         ADV.[VisitID]          AS  VisitID
        ,ADV.[Name]             AS  Name
        ,ADV.[UnitNumber]       AS  UnitNumber
        ,CONVERT(DATE,ADV.[BirthDateTime])                          AS  BirthDate
        ,ADV.[ReasonForVisit]   AS  ReasonForVisit
        ,ADV.[AccountNumber]    AS  AccountNumber
        ,DATEDIFF(day, ADV.ServiceDateTime, DIS.DischargeDateTime)  AS LOS
        ,ADV.[HomePhone]        AS  PhoneNumber
        ,ADV.[ServiceDateTime]  AS  ServiceDateTime
        ,ADV.[Status]           AS  'Status'
        ,PRV.[PrimaryCareID]    AS  PCP
        ,LAB.[TestMnemonic]     AS  Test
        ,LAB.[ResultRW]         AS  Result
        ,LAB.[AbnormalFlag]     AS  AbnormalFlag
        ,LAB.[ResultDateTime]   AS  ResultDateTime
        ,DIS.[Diagnosis]        AS  DischargeDiagnosis
        ,DIS.[ErDiagnosis]      AS  ERDiagnosis
        ,DCP.[TextLine]         AS  ProblemList


FROM          Visits                                      ADV
    LEFT JOIN Tests                                       LAB             ON ( LAB.VisitID  = ADV.VisitID AND
                                                                               LAB.SourceID = ADV.SourceID )
    LEFT JOIN Discharge                                   DIS             ON ( DIS.VisitID  = LAB.VisitID AND
                                                                               DIS.SourceID = LAB.SourceID )
    LEFT JOIN Providers                                   PRV             ON ( PRV.VisitID  = DIS.VisitID AND
                                                                               PRV.SourceID = DIS.SourceID )
    LEFT JOIN ProblemListVisits                           EPS             ON ( EPS.VisitID  = PRV.VisitID AND
                                                                               EPS.SourceID = PRV.SourceID )                                                                                                 
    LEFT JOIN ProblemList                                 DCP             ON ( DCP.PatientID = EPS.PatientID AND
                                                                               DCP.SourceID  = EPS.SourceID )


WHERE ( DCP.[TextLine]       LIKE '%Diabetes%'          OR 
        DCP.[TextLine]       LIKE '%Diabetic%'          OR
        DCP.[TextLine]       LIKE '%DM2%'               OR
        DCP.[TextLine]       LIKE '%DKA%'               OR
        DCP.[TextLine]       LIKE '%Hyperglycemia%'     OR
        DCP.[TextLine]       LIKE '%Hypoglycemia%'    ) AND
      ( LAB.[TestMnemonic] = 'GLU'                      OR
        LAB.[TestMnemonic] = '%HA1C'                  ) AND
        ADV.[Status]      != 'DIS CLI'                )

So this works okay, but when the doctor goes into the patient's Problem List and makes a change it refiles the whole list, which populates the ProblemList table again. So for 1 visit, I may get 4 duplicate entries thanks to the ProblemList and I only want one. It doesn't matter which one either.

I tried referencing other questions and nest another SELECT statement in but I just kept getting syntax errors.

This is what the duplicate values look like:

1111111111  SMITH,JOHN  1111    1/1/1901    CHEST PAIN  1111    2   111-111-1111    1/1/1901 12:15  DIS IN  DOEJO   GLU 120  H  1/2/1901 6:35   NULL    CHEST PAIN  Diabetes type 2, controlled
1111111111  SMITH,JOHN  1111    1/1/1901    CHEST PAIN  1111    2   111-111-1111    1/1/1901 12:15  DIS IN  DOEJO   GLU 120  H  1/2/1901 6:35   NULL    CHEST PAIN  Diabetes type 2, controlled
1111111111  SMITH,JOHN  1111    1/1/1901    CHEST PAIN  1111    2   111-111-1111    1/1/1901 12:15  DIS IN  DOEJO   GLU 120  H  1/2/1901 6:35   NULL    CHEST PAIN  Diabetes type 2, controlled
1111111111  SMITH,JOHN  1111    1/1/1901    CHEST PAIN  1111    2   111-111-1111    1/1/1901 12:15  DIS IN  DOEJO   GLU 120  H  1/2/1901 6:35   NULL    CHEST PAIN  Diabetes type 2, controlled

At the very end, the 'Diabetes type 2, controlled' is what's causing the duplicates. If I remove the ProblemListVisit and ProblemList tables from the query I get only one row of data.

What's most important is getting all of the unique Test results but NOT all the duplicate entries of the problem list (just want to know what type of diabetes they have, ONCE).

Thanks!

Upvotes: 1

Views: 165

Answers (3)

JNevill
JNevill

Reputation: 50034

In lieu of DISTINCT which I think would be the quickest way of achieving this, you could also move each of your tables that generate multiple lines into a subquery wherein you GROUP BY the values that you seek for your JOINS and SELECTS.

There are two advantages here:

  1. You gain more control over the output from these more granular tables and

  2. you reduce the overhead on the JOIN, which will cut your I/O and CPU usage, when you restrict what they allow through with the WHERE clause inside the subquery.

Code:

SELECT
         ADV.[VisitID]          AS  VisitID
        ,ADV.[Name]             AS  Name
        ,ADV.[UnitNumber]       AS  UnitNumber
        ,CONVERT(DATE,ADV.[BirthDateTime])                          AS  BirthDate
        ,ADV.[ReasonForVisit]   AS  ReasonForVisit
        ,ADV.[AccountNumber]    AS  AccountNumber
        ,DATEDIFF(day, ADV.ServiceDateTime, DIS.DischargeDateTime)  AS LOS
        ,ADV.[HomePhone]        AS  PhoneNumber
        ,ADV.[ServiceDateTime]  AS  ServiceDateTime
        ,ADV.[Status]           AS  'Status'
        ,PRV.[PrimaryCareID]    AS  PCP
        ,LAB.[TestMnemonic]     AS  Test
        ,LAB.[ResultRW]         AS  Result
        ,LAB.[AbnormalFlag]     AS  AbnormalFlag
        ,LAB.[ResultDateTime]   AS  ResultDateTime
        ,DIS.[Diagnosis]        AS  DischargeDiagnosis
        ,DIS.[ErDiagnosis]      AS  ERDiagnosis
        ,DCP.[TextLine]         AS  ProblemList


FROM          Visits                                      ADV
    LEFT JOIN Tests                                       LAB             ON ( LAB.VisitID  = ADV.VisitID AND
                                                                               LAB.SourceID = ADV.SourceID )
    LEFT JOIN Discharge                                   DIS             ON ( DIS.VisitID  = LAB.VisitID AND
                                                                               DIS.SourceID = LAB.SourceID )
    LEFT JOIN Providers                                   PRV             ON ( PRV.VisitID  = DIS.VisitID AND
                                                                               PRV.SourceID = DIS.SourceID )
    LEFT JOIN 
        (
            SELECT 
                VisitID, 
                SourceID, 
                PatientID
            FROM ProblemListVisits 
            GROUP BY 
                VisitID, 
                SourceID, 
                PatientID
        )                                                 EPS             ON ( EPS.VisitID  = PRV.VisitID AND
                                                                               EPS.SourceID = PRV.SourceID )                                                                                                 
    LEFT JOIN 
        (
            SELECT 
                PatientID, 
                SourceID, 
                TextLine 
            FROM ProblemList 
            WHERE 
                [TextLine]       LIKE '%Diabetes%'          OR 
                [TextLine]       LIKE '%Diabetic%'          OR
                [TextLine]       LIKE '%DM2%'               OR
                [TextLine]       LIKE '%DKA%'               OR
                [TextLine]       LIKE '%Hyperglycemia%'     OR
                [TextLine]       LIKE '%Hypoglycemia%' 
            GROUP BY 
                PatientID, 
                SourceID, 
                TextLine 
        )                                                  DCP             ON ( DCP.PatientID = EPS.PatientID AND
                                                                               DCP.SourceID  = EPS.SourceID )


WHERE ( LAB.[TestMnemonic] = 'GLU'                      OR
        LAB.[TestMnemonic] = '%HA1C'                  ) AND
        ADV.[Status]      != 'DIS CLI'                )

In the event that you are still getting multiples it suggests that [TextLine] has more than one value for each VisitID/PatientID combination in your ProblemList table. At that point you can remove that one from your GROUP BY clause and use some sort of aggregation on that field like MAX([TextLine]) in your subquery. I suspect, though, that you won't have duplicates after using DISTINCT or using this subquery method.

Upvotes: 1

David Hempy
David Hempy

Reputation: 6237

Try adding DISTINCT after SELECT. Something like this:

SELECT DISTINCT
     ADV.[VisitID]          AS  VisitID
    ,ADV.[Name]             AS  Name 
    ...

Upvotes: 0

Smog
Smog

Reputation: 604

the Distinct clause should do the trick but if not you can change

LEFT JOIN ProblemList   DCP             ON ( DCP.PatientID = EPS.PatientID AND
                                             DCP.SourceID  = EPS.SourceID )

for

OUTER APPLY (Select top 1 DCP.[TextLine]  FROM  ProblemList   DCP WHERE
                                             DCP.PatientID = EPS.PatientID  
                                           AND DCP.SourceID  = EPS.SourceID) DCP

Upvotes: 1

Related Questions