Reputation: 37

Same query giving different results

I am still new to working in databases, so please have patience with me. I have read through a number of similar questions, but none of them seem to be talking about the same issue I am facing.

Just a bit of info on what I am doing, I have a table filled with contact information, and some of the contacts are duplicated, but most of the duplicated rows have a truncated phone number, which makes that data useless.

I wrote the following query to search for the duplicates:

WITH CTE (CID, Firstname, lastname, phone, email, length, dupcnt) AS
(
   SELECT 
       CID, Firstname, lastname, phone, email, LEN(phone) AS length,
       ROW_NUMBER() OVER (PARTITION BY Firstname, lastname, email 
                          ORDER BY Firstname) AS dupcnt
   FROM 
       [data.com_raw]
)
SELECT * 
FROM CTE
WHERE dupcnt > 1
  AND length <= 10

I assumed that this query would find all records that have duplicates based on the three columns that I have specified, and select any that have the dupcnt greater than 1, and a phone column with a length less than or equal to 10. But when I run the query more than once I get different result sets each execution. There must be some logic that I am missing here, but I am completely baffled by this. All of the columns are of varchar datatype, except for CID, which is int.

Upvotes: 1

Answers (4)

Philip Kelley

Reputation: 40369

I believe you are getting different results with every run would be because (a) unless clearly specified in the query, you can assume nothing about the order in which SQL return data in a query, and (b) the only ordering criteria you provide is by FirstName, which is far less precise than your grouping (Firstname, lastname, email).

As for the query itself, as written it assumes that the first item found in a given partition contains a valid phone number. Without specifying the order, you cannot know this will be true… and what if all items in a given grouping have invalid phone numbers? Below is my stab at pulling out the data you're looking for, in a hopefully useful format.

WITH CTE --  Sorry, I'm lazy and generally don't list the columns
AS
 (
   SELECT
      Firstname
     ,lastname
     ,phone
     ,count(*)  HowMany  --  How many in group
     ,sum(case len(phone) when 10 then 1 else 0 end)  BadLength  --  How many "bad" in group
    from data.com_raw
    group by
      Firstname
     ,lastname
     ,phone
    having count(*) <> sum(case len(phone) when 10 then 1 else 0 end)
     and count(*) > 1  --  Remove this to find singletons with invalid phone numbers
 )
select
   cr.CID
  ,cr.Firstname
  ,cr.lastname
  ,case len(cr.phone) when 10 then '' else 'Bad' end)  IsBad
  ,cr.phone
  ,cr.email
 from data.com_raw cr
  inner join CTE
   on CTE.Firstname = cr.Firstname
    and CTE.lastname = cr.lastname
    and CTE.phone = cr.phone
 order by
   cr.CID
  ,cr.Firstname
  ,cr.lastname
  ,case len(cr.phone) when 10 then '' else 'Bad' end)
  ,cr.phone

(Yes, if there are no indexes to support this, you will end up with a table scan.)

Upvotes: 0

Jay Desai

Reputation: 871

SELECT Firstname, lastname,email, COUNT(*)
FROM [data.com_raw]
GROUP BY Firstname, lastname,email HAVING COUNT(*)>1
WHERE LEN(PHONE)<= 10

Upvotes: -1

paparazzo

Reputation: 45106

ORDER BY Firstname is non-deterministic here as they all have the same Firstname from the partition by

If CID is unique you could use that for the order by but I suspect you really want count.

Upvotes: 0

JNevill

Reputation: 50308

Instead of ROW_NUMBER() use COUNT(*), and remove the ORDER BY since that's not necessary with COUNT(*).

The way you have it now, you are chunking up records into similar groups/partitions of records by firstname/lastname/email. Then you are ORDERING each group/partition by firstname. Firstname is part of the partition, meaning every firstname in that group/partition is identical. You will get different results depending on how SQL Server fetches the results from storage (which record it found first is 1, what it found second is 2). Every time it fetches records (every time you run this sql) it may fetch each record from disk or cache at a different order.

Count(*) will return ALL duplicate rows

So instead:

 COUNT(*) OVER (PARTITION BY Firstname, lastname, email ) AS dupcnt

Which will return the number of records that share the same firstname, lastname, and email. You then keep any record that is greater than 1.

Upvotes: 3

Same query giving different results

Answers (4)

Related Questions