Hive SQL - Refining JOIN query to ignore Null values

Question

I'm a little new with SQL so bear with me.

I have two tables, each with an ID column. Table A has a column titled role, Table B has a column titled outcome. I want to query these tables to find which rows based on the ID have role = 'PS' and outcome = 'DE'. Here is my code:

SELECT count(*)
FROM A JOIN B
ON (A.id = B.id
AND A.role = 'PS'
AND B.outcome = 'DE')

I've been searching the internet for a way to do this so that it doesn't include rows that have null values for either A.role or B.outcome.

The above code returns lets say 40,100, even though the total number of entries in B where B.outcome = 'DE' is only 40,000. So it is obviously including entries that do not fit my conditions. Is there a way to better refine my query?

Andomar · Accepted Answer

Your query already excludes rows with a null value in A.role. After all, null = 'PS' is not true, and you're using an inner join.

There's an easy explanation of how you can retrieve more rows from the join than there are in B. Say you have these rows for A:

 A.id   A.role
 1      'A'
 1      'A'

And these rows for B:

 B.id   B.outcome
 1      'A'
 1      'A'

Then this query:

select  *
from    A
join    B
on      A.id = B.id and A.role = 'A' and B.role = 'A'

will return 4 rows. That's more than there are in table A or B!

So I'd investigate whether id is unique:

select count(*) from A group by id having count(*) > 1
select count(*) from B group by id having count(*) > 1

If these queries return a count greater than zero, id is not unique. Since a join repeats rows for each match, that would explain a large increase in the amount of returned records.

Hive SQL - Refining JOIN query to ignore Null values

Answers (1)

Related Questions