Reputation: 1190
I'm doing a query in BigQuery:
SELECT id FROM [table] WHERE city = 'New York City' GROUP BY id
The weird part is it shows duplicate ids, often right next to each other. There is absolutely nothing different between the ids themselves. There are around 3 million rows total, for ~500k IDs. So there are a lot of duplicates, but that is by design. We figured the filtering would easily eliminate that but noticed discrepancies in totals.
Is there a reason BigQuery's GROUP BY
function would work improperly? For what its worth, the dataset has ~3 million rows.
Example of duplicate ID:
56abdb5b9a75d90003001df6
56abdb5b9a75d90003001df6
Upvotes: 1
Views: 556
Reputation: 172944
the only explanation is your id is STRING and in reality those two ids are different because of spaces before or most likely after what is "visible" for eyes
I recommend you to adjust your query like below
SELECT REPLACE(id, ' ', '')
FROM [table]
WHERE city = 'New York City'
GROUP BY 1
another option to troubleshoot would be below
SELECT id, LENGTH(id)
FROM [table]
WHERE city = 'New York City'
GROUP BY 1, 2
so you can see if those ids are same by length or not - my initial assumption was about space - but it can be any other char(s) including non printable
Upvotes: 4