left outer join on table with conditional select of record Oracle SQL/PL/SQL

Question

I have a scenario where I need to join from my starting table, account_tbl to a stu_ssn_tbl. The columns of my join are stu_no and stu_seq_no.

Here's the structure of stu_ssn_tbl

CREATE TABLE stu_ssn_tbl 
(
curr_ssn_ind    character(1),
no  character(9),
pseudo_ssn_ind  character(1),
src character(8),
stu_dob date,
stu_no  integer ,
stu_seq_no  integer,
sys_cr_dt   date,
do_not_use character(1)
);

Here's my sample data:

Insert into stu_ssn_tbl (CURR_SSN_IND,NO,PSEUDO_SSN_IND,SRC,STU_DOB,STU_NO,STU_SEQ_NO,SYS_CR_DT,DO_NOT_USE) values ('N','100      ','R','234600  ',to_timestamp('29-DEC-50','DD-MON-RR HH.MI.SSXFF AM'),1,7,to_timestamp('14-AUG-12','DD-MON-RR HH.MI.SSXFF AM'),null);
Insert into stu_ssn_tbl (CURR_SSN_IND,NO,PSEUDO_SSN_IND,SRC,STU_DOB,STU_NO,STU_SEQ_NO,SYS_CR_DT,DO_NOT_USE) values ('Y','101','R','234600  ',to_timestamp('29-DEC-50','DD-MON-RR HH.MI.SSXFF AM'),1,7,to_timestamp('14-AUG-12','DD-MON-RR HH.MI.SSXFF AM'),null);
Insert into stu_ssn_tbl (CURR_SSN_IND,NO,PSEUDO_SSN_IND,SRC,STU_DOB,STU_NO,STU_SEQ_NO,SYS_CR_DT,DO_NOT_USE) values ('Y','102     ','R','713001  ',to_timestamp('11-JAN-00','DD-MON-RR HH.MI.SSXFF AM'),1,6,to_timestamp('14-APR-03','DD-MON-RR HH.MI.SSXFF AM'),null);
Insert into stu_ssn_tbl (CURR_SSN_IND,NO,PSEUDO_SSN_IND,SRC,STU_DOB,STU_NO,STU_SEQ_NO,SYS_CR_DT,DO_NOT_USE) values ('N','203      ','R','368700  ',to_timestamp('05-DEC-53','DD-MON-RR HH.MI.SSXFF AM'),2,3,to_timestamp('01-JUL-98','DD-MON-RR HH.MI.SSXFF AM'),null);
Insert into stu_ssn_tbl (CURR_SSN_IND,NO,PSEUDO_SSN_IND,SRC,STU_DOB,STU_NO,STU_SEQ_NO,SYS_CR_DT,DO_NOT_USE) values ('Y','104 ','R','368700  ',to_timestamp('05-DEC-53','DD-MON-RR HH.MI.SSXFF AM'),2,3,to_timestamp('01-JUL-98','DD-MON-RR HH.MI.SSXFF AM'),null);
Insert into stu_ssn_tbl (CURR_SSN_IND,NO,PSEUDO_SSN_IND,SRC,STU_DOB,STU_NO,STU_SEQ_NO,SYS_CR_DT,DO_NOT_USE) values ('N','105     ','P','341100  ',to_timestamp('20-JUL-43','DD-MON-RR HH.MI.SSXFF AM'),46,3,to_timestamp('11-FEB-13','DD-MON-RR HH.MI.SSXFF AM'),null);
Insert into stu_ssn_tbl (CURR_SSN_IND,NO,PSEUDO_SSN_IND,SRC,STU_DOB,STU_NO,STU_SEQ_NO,SYS_CR_DT,DO_NOT_USE) values ('Y','106','P','341100  ',to_timestamp('20-JUL-43','DD-MON-RR HH.MI.SSXFF AM'),46,3,to_timestamp('11-FEB-13','DD-MON-RR HH.MI.SSXFF AM'),null);
Insert into stu_ssn_tbl (CURR_SSN_IND,NO,PSEUDO_SSN_IND,SRC,STU_DOB,STU_NO,STU_SEQ_NO,SYS_CR_DT,DO_NOT_USE) values ('N','107      ','R','184300  ',to_timestamp('27-JAN-55','DD-MON-RR HH.MI.SSXFF AM'),3,2,to_timestamp('22-MAY-01','DD-MON-RR HH.MI.SSXFF AM'),null);
Insert into stu_ssn_tbl (CURR_SSN_IND,NO,PSEUDO_SSN_IND,SRC,STU_DOB,STU_NO,STU_SEQ_NO,SYS_CR_DT,DO_NOT_USE) values ('Y','108','R','184300  ',to_timestamp('27-JAN-55','DD-MON-RR HH.MI.SSXFF AM'),3,2,to_timestamp('22-MAY-01','DD-MON-RR HH.MI.SSXFF AM'),null);
Insert into stu_ssn_tbl (CURR_SSN_IND,NO,PSEUDO_SSN_IND,SRC,STU_DOB,STU_NO,STU_SEQ_NO,SYS_CR_DT,DO_NOT_USE) values ('N','109      ','R','368700  ',to_timestamp('27-NOV-61','DD-MON-RR HH.MI.SSXFF AM'),1,3,to_timestamp('01-JUL-98','DD-MON-RR HH.MI.SSXFF AM'),null);
Insert into stu_ssn_tbl (CURR_SSN_IND,NO,PSEUDO_SSN_IND,SRC,STU_DOB,STU_NO,STU_SEQ_NO,SYS_CR_DT,DO_NOT_USE) values ('Y','601 ','R','368700  ',to_timestamp('27-NOV-61','DD-MON-RR HH.MI.SSXFF AM'),1,3,to_timestamp('01-JUL-98','DD-MON-RR HH.MI.SSXFF AM'),null);
Insert into stu_ssn_tbl (CURR_SSN_IND,NO,PSEUDO_SSN_IND,SRC,STU_DOB,STU_NO,STU_SEQ_NO,SYS_CR_DT,DO_NOT_USE) values ('N','900      ','R','293900  ',to_timestamp('20-JUN-71','DD-MON-RR HH.MI.SSXFF AM'),1,5,to_timestamp('23-APR-02','DD-MON-RR HH.MI.SSXFF AM'),null);
Insert into stu_ssn_tbl (CURR_SSN_IND,NO,PSEUDO_SSN_IND,SRC,STU_DOB,STU_NO,STU_SEQ_NO,SYS_CR_DT,DO_NOT_USE) values ('Y','2449','R','293900  ',to_timestamp('20-JUN-71','DD-MON-RR HH.MI.SSXFF AM'),1,5,to_timestamp('09-APR-03','DD-MON-RR HH.MI.SSXFF AM'),null);
Insert into stu_ssn_tbl (CURR_SSN_IND,NO,PSEUDO_SSN_IND,SRC,STU_DOB,STU_NO,STU_SEQ_NO,SYS_CR_DT,DO_NOT_USE) values ('N','95','P','293900  ',to_timestamp('20-JUN-71','DD-MON-RR HH.MI.SSXFF AM'),1,5,to_timestamp('23-APR-02','DD-MON-RR HH.MI.SSXFF AM'),null);

For the sake of simplicity, we'll say this is the structure of the account table.

CREATE TABLE account_tbl
(
id integer,
stu_no integer,
stu_seq_no integer
);

So I'm going from account to stu_ssn_tbl on stu_no and stu_seq_no

SELECT *
FROM ACCOUNT_TBL ACCT
LEFT OUTER JOIN STU_SSN_TBL SSN
ON ACCT.STU_NO = SSN.STU_NO
AND ACCT.STU_SEQ_NO = SSN.STU_SEQ_NO;

The problem is that there is no enforced uniqueness on any of these tables (these tables were data dumps from an old database that nobody knows anything about). I need a way to select the correct record from stu_ssn_tbl based on the following rules.

SYS_CR_DT date closest to the ROLLUPDATE( a date that will be provided, without going over it). For the sake of the test case we can say '2012-12-12'
If two or more SYS_CR_DT dates exist that are the same, choose the record with a CURR_SSN_IND = 'Y'.
If a record cannot be selected based on CURR_SSN_IND = 'Y' ( in the case that they are both 'Y' or 'N') select the record where PSEUDO_SSN_IND = 'R'
If a match still cannot be selected, choose record at random.

I do not have a requirement as far as using PL/SQL or regular SQL to accomplish this, just need to keep in mind that I'm dealing with over 300 million records in the account_tbl and close to 100 million records in the stu_ssn_tbl.

I have two trains of thoughts on this. The preferable way to this would be during the join between account_tbl and stu_ssn_tbl. However, if this is not possible, I've added the do_not_use column, which is can use to functionally delete records that I do not want in pre processing.

So far I have this - which helps me settle bullet 1, and I decided to order an order by desc on CURR_SSN_IND and PSEUDO_SSN_IND because it would give me the record I would watch at the top. I tried putting a fetch first 1 row on it, but that brought back 1 record for the entire table.

SELECT 
SRC.*
FROM STU_SSN_TBL SRC
LEFT OUTER JOIN STU_SSN_TBL CHK
ON SRC.STU_NO = CHK.STU_NO AND SRC.STU_SEQ_NO = CHK.STU_SEQ_NO
AND SRC.SYS_CR_DT < CHK.SYS_CR_DT
AND SRC.SYS_CR_DT < TO_DATE('2012-12-12', 'yyyy-mm-dd')
WHERE CHK.SYS_CR_DT IS NULL
ORDER BY SRC.STU_NO ASC, SRC.STU_SEQ_NO ASC, SRC.CURR_SSN_IND DESC , SRC.PSEUDO_SSN_IND DESC

Ben · Accepted Answer

This is a classical greatest-n-per-group problem, but on 300m records. The canonical question on this topic has a number of options, all of which should, logically, work. However, most require performing additional table- or index-scans or performing a lot of sorting. So, I'd highly recommend going with either the FIRST or LAST aggregate functions (aka the most difficult thing to search for in Oracle's documentation). You may know these as the KEEP clause.

FIRST and LAST are Oracle-specific extensions to the SQL standard and their usefulness with large datasets comes from the fact that a SORT GROUP BY is immediately performed on the data-set, which can massively reduce the amount of records considered for the remainder of your query. I'd highly recommend reading Rob van Wijk's blog post for a fuller explanation.

You're correct that you need to provide a sort order, but you want to sort within the context of each group; to go through your logic:

SYS_CR_DT date closest to the ROLLUPDATE( a date that will be provided, without going over it). For the sake of the test case we can say '2012-12-12'

Assuming the maximum date is included in the ORDER BY clause you want to order by SYS_CR_DATE desc.

If two or more SYS_CR_DT dates exist that are the same, choose the record with a CURR_SSN_IND = 'Y'.

Then, by a statement that priorities a specific value. I'd normally implement this with a CASE statement: case when CURR_SSN_IND = 'Y' then 0 else 1 end asc (ascending is the default sort order but when doing complicated things it's clearer if you're explicit).

If a record cannot be selected based on CURR_SSN_IND = 'Y' ( in the case that they are both 'Y' or 'N') select the record where PSEUDO_SSN_IND = 'R'

This is the same logic as above, save with a different column and value; giving you: case when PSEUDO_SSN_IND = 'R' then 0 else 1 end asc

If a match still cannot be selected, choose record at random.

This is where you might run into problems. Oracle will pick a random record for you, there's no need to explicitly state randomness. However, because FIRST and LAST will sort randomly per column you need to explicitly provide an ORDER BY that removes randomness. I'd use the [ROWIDs]5 of the columns in the tables. This is not guaranteed to be consistent over time but will remain so for the duration of a transaction.

Looking at your last query I assume that this hasn't been thought through completely as you're not selecting from both your tables and I'm a little confused about your logic with the dates. I'm also a little unsure about the LEFT OUTER JOIN. I'm assuming here that you want a table unique on STU_NO and STU_SEQ_NO that contains all of those present in ACCOUNT_TBL and it doesn't matter whether they're present in STU_SSN_TBL.

So, putting the logic together this is what you'd end up with:

select a.stu_no
     , a.stu_seq_no
     , max(s.curr_ssn_id) keep (
         dense_rank first order by sys_cr_date desc
                                 , case when curr_ssn_ind = 'Y' then 0 else 1 end asc                     
                                 , case when pseudo_ssn_ind = 'R' then 0 else 1 end asc
                                 , s.rowid asc
                                 , a.rowid asc
                                   ) as curr_ssn_id
     , max(s.no) keep (
         dense_rank first order by sys_cr_date desc
                                 , case when curr_ssn_ind = 'Y' then 0 else 1 end asc                     
                                 , case when pseudo_ssn_ind = 'R' then 0 else 1 end asc
                                 , s.rowid asc
                                 , a.rowid asc
                                   ) as no
     , -- ...
     , -- keep repeating for every column
     , -- ...
     , max(s.sys_cr_dt) keep (
         dense_rank first order by sys_cr_date desc
                                 , case when curr_ssn_ind = 'Y' then 0 else 1 end asc                     
                                 , case when pseudo_ssn_ind = 'R' then 0 else 1 end asc
                                 , s.rowid asc
                                 , a.rowid asc
                                   ) as sys_cr_date
     , -- keep repeating for every column
  from account_tbl a
  left outer join stu_ssn_tbl s
    on a.stu_no = s.stu_no
   and a.stu_seq_no = s.stu_seq_no
 where s.sys_cr_date < date '2012-12-12'
 group by a.stu_no
        , a.stu_seq_no

As you can see this is pretty... ugly. But, it should be the most efficient way of performing it.

Your next problem is the 300m rows. On any semi-decent box/disks I wouldn't expect any problems with this, even single-threaded. But, if you have Enterprise edition it'd be worth using the parallel hint and if you don't DBMS_PARALLEL_EXECUTE is more complicated but could be used to chunk ACCOUNT_TBL so that you can parallelise. Ensure that you do this via SQL so that you don't have STU_NO/STU_SEQ_NO combinations in multiple chunks.

You mention that you have a lot of disk space, which is good but if you can avoid the disks you're in a better position. You'll have to do to table accesses here, and one hash join, Oracle will be writing all of this to your temporary tablespace, and will then perform some sort operations. You want to avoid performing these joins and sorts more times than you need to.

Your next concern should be whether you need to reproduce this. If you do you have a 2 options:

If you need to keep all the data available then use your idea of a DO_NOT_USE flag. This is, I hope, a contradiction and you don't need to do this.
Create a new set of tables that has the information you need and archive off the rest - there's no point keeping huge amounts of completely unused data "hot" for use.

If you're going to be performing this sort of operation frequently look at sizing the [buffer cache][9] to be able to put the entire lot (and your normal operations) in RAM.

left outer join on table with conditional select of record Oracle SQL/PL/SQL

Answers (1)

Related Questions