Reputation: 141
I run the following query to find the number of duplicates within my tables
SELECT
sum(duplicates)
FROM
(SELECT
COUNT (*) as duplicates
FROM
db.4585_20150727,
db.5691_20150727,
db.9884263_20150727,
db.9884623_20150727
WHERE
(A LIKE "1" OR A LIKE "2" OR A LIKE "3") AND
B NOT LIKE "XYZ" AND ID IS NOT null
GROUP EACH BY
ID
having
count (*) >1)
I want to see what type of units are duplicating and where. field_1 is unit type and field_2 is unit location. a single unit type can be in multiple locations and a location can have multiple unit types. Instead of just seeing the overall number of duplicates, I want to see the number of duplicates per unit type, unit location.
When I run the following, my number of results is inconsistent with the output of the first query. I am all but certain that the inconsistency is due to the group by. Is it possible that the group by is filtering out some rows? Shouldn't the sum of duplicates be identical to the first query? The result of the query below is smaller than the result of the initial sum of duplicates query above. The only difference is that I am including field_1 and field_2 in my selects and group bys.
SELECT
field_1,
field_2,
sum(duplicates)
FROM
(SELECT
field_1,
field_2,
COUNT (*) as duplicates
FROM
db.4585_20150727,
db.5691_20150727,
db.9884263_20150727,
db.9884623_20150727
WHERE
(A LIKE "1" OR A LIKE "2" OR A LIKE "3") AND
B NOT LIKE "XYZ" AND ID IS NOT null
GROUP EACH BY
field_1,
field_2,
ID
having
count (*) >1)
group each by
field_1,
field_2
Please let me know how I need to adjust this second query to be consistent with the result of my initial query.
Thanks!
Upvotes: 3
Views: 284
Reputation: 59225
My guess: You are running a "HAVING COUNT()>1". When the first query runs it filters some elements. When the second query runs, as it has more restrictive groups, more rows fall in the COUNT()=1 bucket, thus they get filtered.
SELECT SUM(duplicates)
FROM (
SELECT COUNT(*) as duplicates
FROM [publicdata:samples.shakespeare]
GROUP EACH BY word
HAVING COUNT(*)>1
)
149130
vs
SELECT SUM(duplicates)
FROM (
SELECT COUNT(*) as duplicates
FROM [publicdata:samples.shakespeare]
GROUP EACH BY word, word_count
HAVING COUNT(*)>1
)
130619
Upvotes: 3