Reputation: 9815
I have a table which has records that contain a persons information and a filename that the information originated from, so the table looks like so:
|Table|
|id, first-name, last-name, ssn, filename|
I also have a stored procedure that provides some analytics for the files in the system and i'm trying to add information to that stored procedure to shed light into the possibility of duplicates.
Here is the current stored procedure
SELECT [filename],
COUNT([filename]) as totalRecords,
COUNT(closedleads.id) as closedRecords,
ROUND(--calcs percent of records closed in a file)
FROM table
LEFT OUTER JOIN closedleads ON closedleads.leadid = table.id
GROUP BY [filename]
What I want to add is the ability to see maybe # of possible duplicates, defined as records with matching SSNs and I am at a loss as to how I could perform a count on a sub query or join and include it in the results set. Can anyone provide some pointers?
What I'm trying to do is add something like this to my procedure above
SELECT COUNT(
SELECT COUNT(*) FROM Table T1
INNER JOIN Table T2 on T1.SSN = T2.SSN
WHERE T1.id != T2.id
) as PossibleDuplicates
What I'm looking for is merging this code with my procedure above so I can get all of the same data in one and possible have this # of duplicates across each filename, so for each filename I get a result of # of records, # of records closed and # of possible duplicates
EDIT:
I'm very close to my desired goal but I'm failing on the last little bit--getting the number of possible duplicates BY filename, here is my query
select [q1].[filename], [q1].leads, [q1].closed, [q2].dups
FROM (
SELECT [filename], count([filename]) as leads,
count(closedleads.id) as closed
FROM Table
left join closedleads on closedleads.leadid = Table.id
group by [filename]
) as [q1]
INNER JOIN (
select count([ssn]) as dups, [filename] from Table
group by [ssn], [filename]
having count([ssn]) > 1
) as [q2] on [q1].[filename] = [q2].[filename]
This works but it showing multiple results for each filename with values of 2-5 instead of summing the total count of possible duplicates
Hey everyone, thanks for all the help, this is eventually what I got to that worked exactly as I wanted
select [q1].[filename], [q1].leads, [q1].closed, [q2].dups,
round(([q1].closed / [q1].leads), 3) as percentClosed
FROM (
SELECT [filename], count([filename]) as leads,
count(closedleads.id) as closed
FROM Table
left join closedleads on closedleads.leadid = Table.id
and [filename] is not null
group by [filename]
) as [q1]
INNER JOIN (
select [filename], count(*) - count(distinct [ssn]) as dups
from Table
group by [filename]
) as [q2] on [q1].[filename] = [q2].[filename]
Upvotes: 2
Views: 1966
Reputation: 3893
I think the existing answers don't quite understand your question. I think I do but it's not completely specified yet. Is it a duplicate if the same SSN appears in two different files or only within the same file? Because you group by filename, that becomes the grain.
The Output of your query is like
StateFarm1, 500, 50, 10%, <your new value goes here>
AllState2, 100, 90, 90% <your new value goes here>
So if you have the same SSN in those two files, you have 1 duplicate, so on which row do you show 1, on the AllState row or the Statefarm row? If you say both, invariably someone will SUM that column and get a doubling of the results.
Now What if you have a Geico row with the same SSN, is that 1 duplicate or 2? and again which row?
I know this isn't a final answer but these questions do highlight the the question as it stands is unanswerable... you fix this and I'll change the answer,
please no downvotes in the meantime
I believe the only thing you are missing is a DISTINCT.
select [q1].[filename], [q1].leads, [q1].closed, [q2].dups
FROM (
SELECT [filename], count([filename]) as leads,
count(closedleads.id) as closed
FROM tbldata
left join closedleads on closedleads.leadid = Table.id
group by [filename]
) as [q1]
INNER JOIN (
select count( DISTINCT [ssn]) as dups, [filename] from Table '<---- here'
group by [ssn], [filename]
having count([ssn]) > 1
) as [q2] on [q1].[filename] = [q2].[filename]
Upvotes: 0
Reputation: 67068
Edit - Updated with a better example which matches your question better
Here's an example if I understand correctly.
create table #table (id int,ssn varchar(10))
insert into #table values(1,'10')
insert into #table values(2,'10')
insert into #table values(3,'11')
insert into #table values(4,'12')
insert into #table values(5,'11')
insert into #table values(6,'13')
select sum(cnt)
from (
select count(distinct ssn) as cnt
from #table
group by ssn
having count(*)>1
) dups
You shouldn't need to self join the table if you group by ssn and then pull back only ssn's where you have more then one.
Upvotes: 1
Reputation: 17388
You'll probably want to make use of a HAVING clause somewhere, eg:
LEFT JOIN (
SELECT SSN, COUNT(SSN) - 1 DupeCount FROM Table T1
GROUP BY SSN
HAVING COUNT(SSN) > 1 ) AS PossibleDuplicates
ON table.ssn = PossibleDuplicates.SSN
If you want to include 0 possible duplicates (rather than null) you actually don't need the HAVING
clause, just the left join.
Upvotes: 3
Reputation: 65157
You don't need the outer COUNT
- your inner SELECT COUNT(*)...
will return you just one number, a count of records with duplicate SSN
but different id
.
Upvotes: 0