Reputation: 207
I'm pretty much a noob when it comes to SQL so any help would be appreciated. I have a large data set that I am filtering through for a hospital. I am pulling data from 6 different tables and one of my tables has duplicate rows for each visit. I only want to pull in one row for each visit (it doesn't matter which row is pulled in). I know I need to use a DISTINCT, or GROUP BY clause but my syntax must be wrong.
SELECT
ADV.[VisitID] AS VisitID
,ADV.[Name] AS Name
,ADV.[UnitNumber] AS UnitNumber
,CONVERT(DATE,ADV.[BirthDateTime]) AS BirthDate
,ADV.[ReasonForVisit] AS ReasonForVisit
,ADV.[AccountNumber] AS AccountNumber
,DATEDIFF(day, ADV.ServiceDateTime, DIS.DischargeDateTime) AS LOS
,ADV.[HomePhone] AS PhoneNumber
,ADV.[ServiceDateTime] AS ServiceDateTime
,ADV.[Status] AS 'Status'
,PRV.[PrimaryCareID] AS PCP
,LAB.[TestMnemonic] AS Test
,LAB.[ResultRW] AS Result
,LAB.[AbnormalFlag] AS AbnormalFlag
,LAB.[ResultDateTime] AS ResultDateTime
,DIS.[Diagnosis] AS DischargeDiagnosis
,DIS.[ErDiagnosis] AS ERDiagnosis
,DCP.[TextLine] AS ProblemList
FROM Visits ADV
LEFT JOIN Tests LAB ON ( LAB.VisitID = ADV.VisitID AND
LAB.SourceID = ADV.SourceID )
LEFT JOIN Discharge DIS ON ( DIS.VisitID = LAB.VisitID AND
DIS.SourceID = LAB.SourceID )
LEFT JOIN Providers PRV ON ( PRV.VisitID = DIS.VisitID AND
PRV.SourceID = DIS.SourceID )
LEFT JOIN ProblemListVisits EPS ON ( EPS.VisitID = PRV.VisitID AND
EPS.SourceID = PRV.SourceID )
LEFT JOIN ProblemList DCP ON ( DCP.PatientID = EPS.PatientID AND
DCP.SourceID = EPS.SourceID )
WHERE ( DCP.[TextLine] LIKE '%Diabetes%' OR
DCP.[TextLine] LIKE '%Diabetic%' OR
DCP.[TextLine] LIKE '%DM2%' OR
DCP.[TextLine] LIKE '%DKA%' OR
DCP.[TextLine] LIKE '%Hyperglycemia%' OR
DCP.[TextLine] LIKE '%Hypoglycemia%' ) AND
( LAB.[TestMnemonic] = 'GLU' OR
LAB.[TestMnemonic] = '%HA1C' ) AND
ADV.[Status] != 'DIS CLI' )
So this works okay, but when the doctor goes into the patient's Problem List and makes a change it refiles the whole list, which populates the ProblemList table again. So for 1 visit, I may get 4 duplicate entries thanks to the ProblemList and I only want one. It doesn't matter which one either.
I tried referencing other questions and nest another SELECT statement in but I just kept getting syntax errors.
This is what the duplicate values look like:
1111111111 SMITH,JOHN 1111 1/1/1901 CHEST PAIN 1111 2 111-111-1111 1/1/1901 12:15 DIS IN DOEJO GLU 120 H 1/2/1901 6:35 NULL CHEST PAIN Diabetes type 2, controlled
1111111111 SMITH,JOHN 1111 1/1/1901 CHEST PAIN 1111 2 111-111-1111 1/1/1901 12:15 DIS IN DOEJO GLU 120 H 1/2/1901 6:35 NULL CHEST PAIN Diabetes type 2, controlled
1111111111 SMITH,JOHN 1111 1/1/1901 CHEST PAIN 1111 2 111-111-1111 1/1/1901 12:15 DIS IN DOEJO GLU 120 H 1/2/1901 6:35 NULL CHEST PAIN Diabetes type 2, controlled
1111111111 SMITH,JOHN 1111 1/1/1901 CHEST PAIN 1111 2 111-111-1111 1/1/1901 12:15 DIS IN DOEJO GLU 120 H 1/2/1901 6:35 NULL CHEST PAIN Diabetes type 2, controlled
At the very end, the 'Diabetes type 2, controlled' is what's causing the duplicates. If I remove the ProblemListVisit and ProblemList tables from the query I get only one row of data.
What's most important is getting all of the unique Test results but NOT all the duplicate entries of the problem list (just want to know what type of diabetes they have, ONCE).
Thanks!
Upvotes: 1
Views: 165
Reputation: 50034
In lieu of DISTINCT
which I think would be the quickest way of achieving this, you could also move each of your tables that generate multiple lines into a subquery wherein you GROUP BY the values that you seek for your JOINS and SELECTS.
There are two advantages here:
You gain more control over the output from these more granular tables and
you reduce the overhead on the JOIN, which will cut your I/O and CPU usage, when you restrict what they allow through with the WHERE clause inside the subquery.
Code:
SELECT
ADV.[VisitID] AS VisitID
,ADV.[Name] AS Name
,ADV.[UnitNumber] AS UnitNumber
,CONVERT(DATE,ADV.[BirthDateTime]) AS BirthDate
,ADV.[ReasonForVisit] AS ReasonForVisit
,ADV.[AccountNumber] AS AccountNumber
,DATEDIFF(day, ADV.ServiceDateTime, DIS.DischargeDateTime) AS LOS
,ADV.[HomePhone] AS PhoneNumber
,ADV.[ServiceDateTime] AS ServiceDateTime
,ADV.[Status] AS 'Status'
,PRV.[PrimaryCareID] AS PCP
,LAB.[TestMnemonic] AS Test
,LAB.[ResultRW] AS Result
,LAB.[AbnormalFlag] AS AbnormalFlag
,LAB.[ResultDateTime] AS ResultDateTime
,DIS.[Diagnosis] AS DischargeDiagnosis
,DIS.[ErDiagnosis] AS ERDiagnosis
,DCP.[TextLine] AS ProblemList
FROM Visits ADV
LEFT JOIN Tests LAB ON ( LAB.VisitID = ADV.VisitID AND
LAB.SourceID = ADV.SourceID )
LEFT JOIN Discharge DIS ON ( DIS.VisitID = LAB.VisitID AND
DIS.SourceID = LAB.SourceID )
LEFT JOIN Providers PRV ON ( PRV.VisitID = DIS.VisitID AND
PRV.SourceID = DIS.SourceID )
LEFT JOIN
(
SELECT
VisitID,
SourceID,
PatientID
FROM ProblemListVisits
GROUP BY
VisitID,
SourceID,
PatientID
) EPS ON ( EPS.VisitID = PRV.VisitID AND
EPS.SourceID = PRV.SourceID )
LEFT JOIN
(
SELECT
PatientID,
SourceID,
TextLine
FROM ProblemList
WHERE
[TextLine] LIKE '%Diabetes%' OR
[TextLine] LIKE '%Diabetic%' OR
[TextLine] LIKE '%DM2%' OR
[TextLine] LIKE '%DKA%' OR
[TextLine] LIKE '%Hyperglycemia%' OR
[TextLine] LIKE '%Hypoglycemia%'
GROUP BY
PatientID,
SourceID,
TextLine
) DCP ON ( DCP.PatientID = EPS.PatientID AND
DCP.SourceID = EPS.SourceID )
WHERE ( LAB.[TestMnemonic] = 'GLU' OR
LAB.[TestMnemonic] = '%HA1C' ) AND
ADV.[Status] != 'DIS CLI' )
In the event that you are still getting multiples it suggests that [TextLine] has more than one value for each VisitID/PatientID combination in your ProblemList table. At that point you can remove that one from your GROUP BY clause and use some sort of aggregation on that field like MAX([TextLine])
in your subquery. I suspect, though, that you won't have duplicates after using DISTINCT
or using this subquery method.
Upvotes: 1
Reputation: 6237
Try adding DISTINCT
after SELECT
. Something like this:
SELECT DISTINCT
ADV.[VisitID] AS VisitID
,ADV.[Name] AS Name
...
Upvotes: 0
Reputation: 604
the Distinct
clause should do the trick
but if not you can change
LEFT JOIN ProblemList DCP ON ( DCP.PatientID = EPS.PatientID AND
DCP.SourceID = EPS.SourceID )
for
OUTER APPLY (Select top 1 DCP.[TextLine] FROM ProblemList DCP WHERE
DCP.PatientID = EPS.PatientID
AND DCP.SourceID = EPS.SourceID) DCP
Upvotes: 1