Reputation: 700
I have three tables A, B and C. A is having 1 billion records, B is having 10 million records and C is having 5 million records. My query is like
select *
from tableA a
left outer join tableB b on a.id=b.id
left outer join tableC c on b.id=c.id;
After first join i will be having more than 990 million NULL b.id columns. Now the second join on table C will require all 990 million NULL rows (b.Id) to be processed and this causes one reducer to be loaded for a very long time. Is there a way i can avoid rows with NULL join columns?
Upvotes: 2
Views: 7420
Reputation: 700
We have used rand() for NULL ; so our join condition will be
coalesce(b.id, rand()) = c.id
Thus null values got distributed by its own, but i am wondering why the skewjoin settings didnot help (we have tried coalesce(b.id, 'SomeString') = c.id with skewjoin enable )
Upvotes: 2
Reputation: 38335
Add b.id is not null
condition to the ON clause. Depending on your Hive version this may help:
select *
from tableA a
left outer join tableB b on a.id=b.id
left outer join tableC c on b.id=c.id and b.id is not null;
But this is not a problem since 0.14 version as far as I know.
Also you can divide null rows and not null and join only not null rows. In the first query only null rows selected. Add NULL as col for columns from C table. Then use UNION ALL + select all not null rows:
with a as(
select a.*, b.*
from tableA a
left outer join tableB b on a.id=b.id
)
select a.*, null as c_col1 --add all other columns(from c) as null to get same schema
from a where a.b_id_col is null
UNION ALL
select a.*, c.*
left outer join tableC c on a.b_id_col=c.id
from a where a.b_id_col is not null
Upvotes: 1