Reputation: 9815

Help with SQL aggregate query with detecting duplicates

I have a table which has records that contain a persons information and a filename that the information originated from, so the table looks like so:

|Table|
|id, first-name, last-name, ssn, filename|

I also have a stored procedure that provides some analytics for the files in the system and i'm trying to add information to that stored procedure to shed light into the possibility of duplicates.

Here is the current stored procedure

SELECT [filename],
       COUNT([filename]) as totalRecords,
       COUNT(closedleads.id) as closedRecords,
       ROUND(--calcs percent of records closed in a file)
FROM table
LEFT OUTER JOIN closedleads ON closedleads.leadid = table.id
GROUP BY [filename]

What I want to add is the ability to see maybe # of possible duplicates, defined as records with matching SSNs and I am at a loss as to how I could perform a count on a sub query or join and include it in the results set. Can anyone provide some pointers?

What I'm trying to do is add something like this to my procedure above

SELECT COUNT(
    SELECT COUNT(*) FROM Table T1
    INNER JOIN Table T2 on T1.SSN = T2.SSN
    WHERE T1.id != T2.id
) as PossibleDuplicates

What I'm looking for is merging this code with my procedure above so I can get all of the same data in one and possible have this # of duplicates across each filename, so for each filename I get a result of # of records, # of records closed and # of possible duplicates

EDIT:

I'm very close to my desired goal but I'm failing on the last little bit--getting the number of possible duplicates BY filename, here is my query

select [q1].[filename], [q1].leads, [q1].closed, [q2].dups
FROM (
    SELECT [filename], count([filename]) as leads,
    count(closedleads.id) as closed
    FROM Table
    left join closedleads on closedleads.leadid = Table.id
    group by [filename]
) as [q1]
INNER JOIN (
    select count([ssn]) as dups, [filename] from Table
    group by [ssn], [filename]
    having count([ssn]) > 1
) as [q2] on [q1].[filename] = [q2].[filename]

This works but it showing multiple results for each filename with values of 2-5 instead of summing the total count of possible duplicates

Working Query

Hey everyone, thanks for all the help, this is eventually what I got to that worked exactly as I wanted

select [q1].[filename], [q1].leads, [q1].closed, [q2].dups,
        round(([q1].closed / [q1].leads), 3) as percentClosed
FROM (
    SELECT [filename], count([filename]) as leads,
    count(closedleads.id) as closed
    FROM Table
    left join closedleads on closedleads.leadid = Table.id
    and [filename] is not null
    group by [filename]
) as [q1]
INNER JOIN (
    select [filename], count(*) - count(distinct [ssn]) as dups 
            from Table
            group by [filename]
) as [q2] on [q1].[filename] = [q2].[filename]

Upvotes: 2

Answers (4)

Stephanie Page

Reputation: 3893

I think the existing answers don't quite understand your question. I think I do but it's not completely specified yet. Is it a duplicate if the same SSN appears in two different files or only within the same file? Because you group by filename, that becomes the grain.

The Output of your query is like

StateFarm1, 500,   50,    10%,   <your new value goes here>
AllState2,  100,   90,    90%    <your new value goes here>

So if you have the same SSN in those two files, you have 1 duplicate, so on which row do you show 1, on the AllState row or the Statefarm row? If you say both, invariably someone will SUM that column and get a doubling of the results.

Now What if you have a Geico row with the same SSN, is that 1 duplicate or 2? and again which row?

I know this isn't a final answer but these questions do highlight the the question as it stands is unanswerable... you fix this and I'll change the answer,

please no downvotes in the meantime

Addendum

I believe the only thing you are missing is a DISTINCT.

select [q1].[filename], [q1].leads, [q1].closed, [q2].dups
FROM (
    SELECT [filename], count([filename]) as leads,
    count(closedleads.id) as closed
    FROM tbldata
    left join closedleads on closedleads.leadid = Table.id
    group by [filename]
) as [q1]
INNER JOIN (
    select count( DISTINCT [ssn]) as dups, [filename] from Table '<---- here'
    group by [ssn], [filename]
    having count([ssn]) > 1
) as [q2] on [q1].[filename] = [q2].[filename]

Upvotes: 0

JoshBerke

Reputation: 67068

Edit - Updated with a better example which matches your question better

Here's an example if I understand correctly.

create table #table  (id int,ssn varchar(10))

insert into #table values(1,'10')
insert into #table values(2,'10')

insert into #table values(3,'11')
insert into #table values(4,'12')


insert into #table values(5,'11')

insert into #table values(6,'13')


select sum(cnt)
from (
select count(distinct ssn) as cnt
from #table
group by ssn 
having count(*)>1
) dups

You shouldn't need to self join the table if you group by ssn and then pull back only ssn's where you have more then one.

Upvotes: 1

Kevin Stricker

Reputation: 17388

You'll probably want to make use of a HAVING clause somewhere, eg:

    LEFT JOIN (
        SELECT SSN, COUNT(SSN) - 1 DupeCount FROM Table T1
        GROUP BY SSN
        HAVING COUNT(SSN) > 1 ) AS PossibleDuplicates
    ON table.ssn = PossibleDuplicates.SSN

If you want to include 0 possible duplicates (rather than null) you actually don't need the HAVING clause, just the left join.

Upvotes: 3

JNK

Reputation: 65157

You don't need the outer COUNT - your inner SELECT COUNT(*)... will return you just one number, a count of records with duplicate SSN but different id.

Upvotes: 0

Help with SQL aggregate query with detecting duplicates

Working Query

Answers (4)

Addendum

Related Questions