ericbrownaustin
ericbrownaustin

Reputation: 1330

BigQuery Full outer join producing "left join" results

I have 2 tables, both of which contain distinct id values. Some of the id values might occur in both tables and some are unique to each table. Table1 has 10,910 rows and Table2 has 11,304 rows

When running a left join query:

SELECT COUNT(DISTINCT a.id)
FROM table1 a
JOIN table2 b on a.id = b.id

I get a total of 10,896 rows or 10,896 ids shared across both tables.

However, when I run a FULL OUTER JOIN on the 2 tables like this:

SELECT COUNT(DISTINCT a.id)
FROM table1 a
FULL OUTER JOIN EACH table2 b on a.id = b.id

I get total of 10,896 rows, but I was expecting all 10,910 rows from table1.

I am wondering if there is an issue with my query syntax.

Upvotes: 6

Views: 37512

Answers (4)

user20565065
user20565065

Reputation: 1

You will have to add coalesce... BigQuery, unlike traditional SQL does not recognize fields unless used explicitly

SELECT COUNT(DISTINCT coalesce(a.id,b.id)) FROM table1 a FULL OUTER JOIN EACH table2 b on a.id = b.id

This query will now take full effect of full outer join :)

Upvotes: 0

Johanna
Johanna

Reputation: 11

It is because you count in both case the number of non-null lines for table a by using a count(distinct a.id). Use a count(*) and it should works.

Upvotes: 1

ericbrownaustin
ericbrownaustin

Reputation: 1330

I added the original query as a subquery and counted ids and produced the expected results. Still a little strange, but it works.

SELECT EXACT_COUNT_DISTINCT(a.id)
FROM
(SELECT a.id AS a.id,
b.id AS b.id
FROM table1 a FULL OUTER JOIN EACH table2 b on a.id = b.id))

Upvotes: 1

Mikhail Berlyant
Mikhail Berlyant

Reputation: 173210

As you are using EACH - it looks like you are running your queries in Legacy SQL mode.
In BigQuery Legacy SQL - COUNT(DISTINCT) function is probabilistic - gives statistical approximation and is not guaranteed to be exact.
You can use EXACT_COUNT_DISTINCT() function instead - this one gives you exact number but a little more expensive on back-end

Even better option - just use Standard SQL
For your specific query you will only need to remove EACH keyword and it should work as a charm

#standardSQL
SELECT COUNT(DISTINCT a.id)
FROM table1 a
JOIN table2 b on a.id = b.id

and

#standardSQL
SELECT COUNT(DISTINCT a.id)
FROM table1 a
FULL OUTER JOIN table2 b on a.id = b.id

Upvotes: 10

Related Questions