Ashley O
Ashley O

Reputation: 1190

BigQuery GROUP BY function still showing duplicates

I'm doing a query in BigQuery:

SELECT id FROM [table] WHERE  city = 'New York City' GROUP BY id

The weird part is it shows duplicate ids, often right next to each other. There is absolutely nothing different between the ids themselves. There are around 3 million rows total, for ~500k IDs. So there are a lot of duplicates, but that is by design. We figured the filtering would easily eliminate that but noticed discrepancies in totals.

Is there a reason BigQuery's GROUP BY function would work improperly? For what its worth, the dataset has ~3 million rows.

Example of duplicate ID:

56abdb5b9a75d90003001df6
56abdb5b9a75d90003001df6

Upvotes: 1

Views: 556

Answers (1)

Mikhail Berlyant
Mikhail Berlyant

Reputation: 172944

the only explanation is your id is STRING and in reality those two ids are different because of spaces before or most likely after what is "visible" for eyes

I recommend you to adjust your query like below

SELECT REPLACE(id, ' ', '') 
FROM [table] 
WHERE city = 'New York City' 
GROUP BY 1  

another option to troubleshoot would be below

SELECT id, LENGTH(id)
FROM [table] 
WHERE city = 'New York City' 
GROUP BY 1, 2    

so you can see if those ids are same by length or not - my initial assumption was about space - but it can be any other char(s) including non printable

Upvotes: 4

Related Questions