sum of results inconsistent when grouping by

Question

I run the following query to find the number of duplicates within my tables

SELECT
sum(duplicates)
FROM
  (SELECT
  COUNT (*) as duplicates
  FROM
  db.4585_20150727,
  db.5691_20150727,
  db.9884263_20150727,
  db.9884623_20150727 
  WHERE
  (A LIKE "1" OR A LIKE "2" OR A LIKE "3") AND
  B NOT LIKE "XYZ" AND ID IS NOT null 
GROUP EACH BY
ID
having
count (*) >1)

I want to see what type of units are duplicating and where. field_1 is unit type and field_2 is unit location. a single unit type can be in multiple locations and a location can have multiple unit types. Instead of just seeing the overall number of duplicates, I want to see the number of duplicates per unit type, unit location.

When I run the following, my number of results is inconsistent with the output of the first query. I am all but certain that the inconsistency is due to the group by. Is it possible that the group by is filtering out some rows? Shouldn't the sum of duplicates be identical to the first query? The result of the query below is smaller than the result of the initial sum of duplicates query above. The only difference is that I am including field_1 and field_2 in my selects and group bys.

SELECT
field_1,
field_2,
sum(duplicates)
FROM
  (SELECT
  field_1,
  field_2,
  COUNT (*) as duplicates
  FROM
  db.4585_20150727,
  db.5691_20150727,
  db.9884263_20150727,
  db.9884623_20150727 
  WHERE
  (A LIKE "1" OR A LIKE "2" OR A LIKE "3") AND
  B NOT LIKE "XYZ" AND ID IS NOT null 
GROUP EACH BY
field_1,
field_2,
ID
having
count (*) >1)
group each by
field_1,
field_2

Please let me know how I need to adjust this second query to be consistent with the result of my initial query.

Thanks!

Felipe Hoffa · Accepted Answer

My guess: You are running a "HAVING COUNT()>1". When the first query runs it filters some elements. When the second query runs, as it has more restrictive groups, more rows fall in the COUNT()=1 bucket, thus they get filtered.

SELECT SUM(duplicates)
FROM (
  SELECT COUNT(*) as duplicates
  FROM [publicdata:samples.shakespeare] 
  GROUP EACH BY word
  HAVING COUNT(*)>1
)

149130

vs

SELECT SUM(duplicates)
FROM (
  SELECT COUNT(*) as duplicates
  FROM [publicdata:samples.shakespeare] 
  GROUP EACH BY word, word_count
  HAVING COUNT(*)>1
)

130619

sum of results inconsistent when grouping by

Answers (1)

Related Questions