Reputation: 37
I am still new to working in databases, so please have patience with me. I have read through a number of similar questions, but none of them seem to be talking about the same issue I am facing.
Just a bit of info on what I am doing, I have a table filled with contact information, and some of the contacts are duplicated, but most of the duplicated rows have a truncated phone number, which makes that data useless.
I wrote the following query to search for the duplicates:
WITH CTE (CID, Firstname, lastname, phone, email, length, dupcnt) AS
(
SELECT
CID, Firstname, lastname, phone, email, LEN(phone) AS length,
ROW_NUMBER() OVER (PARTITION BY Firstname, lastname, email
ORDER BY Firstname) AS dupcnt
FROM
[data.com_raw]
)
SELECT *
FROM CTE
WHERE dupcnt > 1
AND length <= 10
I assumed that this query would find all records that have duplicates based on the three columns that I have specified, and select any that have the dupcnt
greater than 1, and a phone column with a length less than or equal to 10. But when I run the query more than once I get different result sets each execution. There must be some logic that I am missing here, but I am completely baffled by this. All of the columns are of varchar
datatype, except for CID, which is int
.
Upvotes: 1
Views: 4655
Reputation: 40369
I believe you are getting different results with every run would be because (a) unless clearly specified in the query, you can assume nothing about the order in which SQL return data in a query, and (b) the only ordering criteria you provide is by FirstName, which is far less precise than your grouping (Firstname, lastname, email).
As for the query itself, as written it assumes that the first item found in a given partition contains a valid phone number. Without specifying the order, you cannot know this will be true… and what if all items in a given grouping have invalid phone numbers? Below is my stab at pulling out the data you're looking for, in a hopefully useful format.
WITH CTE -- Sorry, I'm lazy and generally don't list the columns
AS
(
SELECT
Firstname
,lastname
,phone
,count(*) HowMany -- How many in group
,sum(case len(phone) when 10 then 1 else 0 end) BadLength -- How many "bad" in group
from data.com_raw
group by
Firstname
,lastname
,phone
having count(*) <> sum(case len(phone) when 10 then 1 else 0 end)
and count(*) > 1 -- Remove this to find singletons with invalid phone numbers
)
select
cr.CID
,cr.Firstname
,cr.lastname
,case len(cr.phone) when 10 then '' else 'Bad' end) IsBad
,cr.phone
,cr.email
from data.com_raw cr
inner join CTE
on CTE.Firstname = cr.Firstname
and CTE.lastname = cr.lastname
and CTE.phone = cr.phone
order by
cr.CID
,cr.Firstname
,cr.lastname
,case len(cr.phone) when 10 then '' else 'Bad' end)
,cr.phone
(Yes, if there are no indexes to support this, you will end up with a table scan.)
Upvotes: 0
Reputation: 871
SELECT Firstname, lastname,email, COUNT(*)
FROM [data.com_raw]
GROUP BY Firstname, lastname,email HAVING COUNT(*)>1
WHERE LEN(PHONE)<= 10
Upvotes: -1
Reputation: 45106
ORDER BY Firstname
is non-deterministic here as they all have the same Firstname from the partition by
If CID is unique you could use that for the order by but I suspect you really want count.
Upvotes: 0
Reputation: 50308
Instead of ROW_NUMBER()
use COUNT(*)
, and remove the ORDER BY since that's not necessary with COUNT(*)
.
The way you have it now, you are chunking up records into similar groups/partitions of records by firstname
/lastname
/email
. Then you are ORDERING each group/partition by firstname
. Firstname
is part of the partition, meaning every firstname in that group/partition is identical. You will get different results depending on how SQL Server fetches the results from storage (which record it found first is 1
, what it found second is 2
). Every time it fetches records (every time you run this sql) it may fetch each record from disk or cache at a different order.
Count(*)
will return ALL duplicate rows
So instead:
COUNT(*) OVER (PARTITION BY Firstname, lastname, email ) AS dupcnt
Which will return the number of records that share the same firstname, lastname, and email. You then keep any record that is greater than 1.
Upvotes: 3