zephyrus
zephyrus

Reputation: 261

How to replace IN clause with JOIN in Oracle SQL?

I'm trying to rewrite the below query to replace the 'IN' clause with inner join

select * from employee_rec er 
  inner join ed_claim_recd ed on er.ssn=ed.insssn and substr(er.group_rec_key,1,10) = substr(er.group_rec_key,1,10) 
    and ed.claim in (select claimno from cd_claim_recd cd where cd.closedt is not null and cd.closedt != '0000000' and cd.closedt >= '2130101')
    and ed.insssn in (select er1.ssn from employee_rec er1 where er1.status != 'ACTIV' and trim(ER1.CLAIMNO) is null)
    and er.sysind not in ('ABC,'BCD')

Below is what I could come up with, but the results are not same as the previous query

select * from employee_rec er
  inner join ed_claim_recd ed on er.ssn = ed.insssn and substr(er.group_rec_key, 1, 10) = substr(er.group_rec_key, 1, 10)
  inner join (select claimno from cd_claim_recd cd where cd.closedt is not null and cd.closedt != '0000000' and cd.closedt >= '2130101') cr on ed.claim = cr.claimno
  inner join (
    select insssn from ed_claim_recd ed2
      inner join (
        select ssn from employee_rec er1
        where
          er1.status != 'ACTIV'
          and trim(ER1.CLAIMNO) is null
      ) er2 on ed2.insssn = er2.ssn
  ) ed3 on ed.insssn = ed3.insssn
  and er.sysind not in ('ABC', 'BCD')

Is this the right approach to rewrite the query or am I way too off? Also, is it an efficient way to rewrite the query to replace "IN' with 'INNER JOIN'?

Upvotes: 3

Views: 2946

Answers (4)

Mahesh Mogal
Mahesh Mogal

Reputation: 658

In case of hive, we do need to replace IN clasue with INNER join. you can rewrite above query as follows. I donot think we need to join employee_rec table twice to get result.

select er.*,ed.* from employee_rec er
inner join 
ed_claim_recd ed
on er.ssn = ed.insssn
inner join cd_claim_recd cd
on ed.claim = cd.claimno
where cd.closedt is not null 
    and cd.closedt != '0000000' 
    and cd.closedt >= '2130101'
    and er.status != 'ACTIV' and trim(er.CLAIMNO) is null
    and er.sysind not in ('ABC,'BCD')

As you can this is the compact and efficient way to achieve your results as we are eliminating one join condition. As I am not aware of data I have not tested it. If you have duplicate CLAIMS and SSN then you need to deal with it. And as @leftjoin mentioned you can further improve performance by eliminating redundant conditions on CLOSEDT

Upvotes: 1

leftjoin
leftjoin

Reputation: 38325

IN subquery and INNER JOIN work differently. Join will output for each join key from one table all rows with the same key from joined table. So, Join can duplicate rows if a join key in the joined table is not unique. IN subquery will not duplicate rows.

For example if in your cr join subquery

inner join (select claimno from cd_claim_recd cd where cd.closedt is not null and cd.closedt != '0000000' and cd.closedt >= '2130101') cr on ed.claim = cr.claimno

the claimno is not unique, then joined rows with matching claimno will be duplicated. This is quite normal Join behavior.

To avoid such duplication, make sure that join key is unique by adding DISTINCT, row_number() filter, group by, etc:

inner join (select DISTINCT claimno from cd_claim_recd cd where cd.closedt is not null and cd.closedt != '0000000' and cd.closedt >= '2130101') cr on ed.claim = cr.claimno

And the same for other such joins.

In this case the result of IN and Join should be the same.

BTW you do not need all these conditions together:

where cd.closedt is not null and cd.closedt != '0000000' and cd.closedt >= '2130101'

because '2130101' is greater than '0000000' and if cd.closedt >= '2130101' it cannot be NULL. cd.closedt >= '2130101' is quite enough.

Have found one more possible problem:

and trim(ER1.CLAIMNO) is null

In Hive (and you have tagged your question with @hive tag) empty string and null are two different things.

('' is not NULL) = true in Hive.

I'd suggest to replace it with and (ER.CLAIMNO is null or trim(ER1.CLAIMNO)='') Empty string is a normal value in Hive, that is why empty strings participate in joins. Convert to NULL or filter them before join if you do not need them to be joined.

ed3 subquery contains redundant join, it is not like original IN subquery.

Maybe there are other issues. Test all joins one by one to find all of them

Upvotes: 5

The following eliminates the IN:

select er.*, ed.*
  from employee_rec er 
  inner join ed_claim_recd ed
    on er.ssn=ed.insssn and
       substr(ed.group_rec_key,1,10) = substr(er.group_rec_key,1,10)
  INNER JOIN (select claimno
                from cd_claim_recd cd
                where cd.closedt is not null and
                      cd.closedt != '0000000' and
                      cd.closedt >= '2130101') j1
    ON j1.CLAIMNO = ed.claim
  INNER JOIN (select er1.ssn from employee_rec er1
                where er1.status != 'ACTIV' and
                      trim(ER1.CLAIMNO) is null) and
                      er.sysind not in ('ABC,'BCD')) j2
    ON j2.SSN = ed.insssn

As @GordonLinoff rightly points out, though, this may not affect performance at all.

Best of luck.

Upvotes: 1

William Robertson
William Robertson

Reputation: 16001

I make it this (untested):

select er.*, ed.*
from   employee_rec er 
       join ed_claim_recd ed
            on  ed.insssn = er.ssn
            and substr(ed.group_rec_key,1,10) /* was er.group_rec_key */ = substr(er.group_rec_key,1,10)
       join cd_claim_recd cd
            on  cd.claimno = ed.claim
       join employee_rec er1
            on  er1.ssn = ed.insssn
where  er.sysind not in ('ABC', 'BCD')
-- and    cd.closedt is not null   -- redundant
-- and    cd.closedt != '0000000'  -- redundant
and    cd.closedt >= '2130101'
and    er1.status != 'ACTIV'
and    trim(er1.claimno) is null

If cd_claim_recd.claimno and employee_rec.ssn and not unique keys then you may need some dedupe logic.

Upvotes: 1

Related Questions