Reputation: 700

Ignore rows with NULL join columns in hive query

I have three tables A, B and C. A is having 1 billion records, B is having 10 million records and C is having 5 million records. My query is like

select * from tableA a left outer join tableB b on a.id=b.id left outer join tableC c on b.id=c.id;

After first join i will be having more than 990 million NULL b.id columns. Now the second join on table C will require all 990 million NULL rows (b.Id) to be processed and this causes one reducer to be loaded for a very long time. Is there a way i can avoid rows with NULL join columns?

Upvotes: 2

Answers (2)

Despicable me

Reputation: 700

We have used rand() for NULL ; so our join condition will be

coalesce(b.id, rand()) = c.id

Thus null values got distributed by its own, but i am wondering why the skewjoin settings didnot help (we have tried coalesce(b.id, 'SomeString') = c.id with skewjoin enable )

Upvotes: 2

leftjoin

Reputation: 38335

Add b.id is not null condition to the ON clause. Depending on your Hive version this may help:

select * 
   from tableA a 
       left outer join tableB b on a.id=b.id 
       left outer join tableC c on b.id=c.id and b.id is not null;

But this is not a problem since 0.14 version as far as I know.

Also you can divide null rows and not null and join only not null rows. In the first query only null rows selected. Add NULL as col for columns from C table. Then use UNION ALL + select all not null rows:

with a as(
select a.*, b.* 
   from tableA a 
       left outer join tableB b on a.id=b.id
)

select a.*, null as c_col1 --add all other columns(from c) as null to get same schema
   from a where a.b_id_col is null
UNION ALL
select a.*, c.*
   left outer join tableC c on a.b_id_col=c.id
   from a where a.b_id_col is not null

Upvotes: 1

Ignore rows with NULL join columns in hive query

Answers (2)

Related Questions