dscl
dscl

Reputation: 1626

ORACLE/SQL - Need help optimizing 'merge' style script

We have a 'merge' script that is used to assign codes to customers. Currently it works by looking at customers in a staging table and assigning them unused codes. Those codes are marked as used and the staged records, with codes, loaded to a production table. The staging table gets cleared and life is peachy.

Unfortunately we are working with a larger data set now (both customers and codes) and the process is taking WAY to long to run. I'm hoping the wonderful community here can look at the code here and offer either improvements upon it or another way of attacking the problem.

Thanks in advance!

Edit - Forgot to mention part of the reason for some of the checks in this is that the staging table is 'living' and can have records feeding into it during the script run.

whenever sqlerror exit 1

-- stagingTable: TAB_000000003134
-- codeTable: TAB_000000003135
-- masterTable: TAB_000000003133

-- dedupe staging table
delete from TAB_000000003134 a
where ROWID > (
  select min(rowid)
  from TAB_000000003134 b
  where a.cust_id = b.cust_id
  );
commit;

delete from TAB_000000003134
where cust_id is null;
commit;


-- set row num on staging table
update TAB_000000003134
set row_num = rownum;
commit;

-- reset row nums on code table
update TAB_000000003135
set row_num = NULL;
commit;

-- assign row nums to codes
update TAB_000000003135
set row_num = rownum
where dateassigned is null
and active = 1;
commit;

-- attach codes to staging table
update TAB_000000003134 d
set (CODE1, CODE2) =
(
  select CODE1, CODE2
  from TAB_000000003135 c
  where d.row_num = c.row_num
);
commit;

-- mark used codes compared to template
update TAB_000000003135 c
set dateassigned = sysdate, assignedto = (select cust_id from TAB_000000003134 d where c.CODE1 = d.CODE1)
where exists (select 'x' from TAB_000000003134 d where c.CODE1 = d.CODE1);
commit;

-- clear and copy data to master
truncate table TAB_000000003133;
insert into TAB_000000003133 (
        <custmomer fields>, code1, code2, TIMESTAMP_
        )
select <custmomer fields>, CODE1, CODE2,SYSDATE
from TAB_000000003134;
commit;

-- remove any staging records with code numbers
delete from TAB_000000003134
where CODE1 is not NULL;
commit;

quit

Upvotes: 3

Views: 1042

Answers (2)

Cheran Shunmugavel
Cheran Shunmugavel

Reputation: 8459

  • Don't commit after every statement. Instead, you should issue one COMMIT at the end of the script. This isn't so much for performance, but because the data is not in a consistent state until the end of the script.

(It turns out there probably are performance benefits to committing less frequently in Oracle, but your primary concern should be about maintaining consistency)

  • You might look into using global temporary tables. The data in a global temp table is only visible to the current session, so you could skip some of the reset steps in your script.

Upvotes: 0

Jon Heller
Jon Heller

Reputation: 36902

  • Combine statements as much as possible. For example, combine the first two deletes by simply adding "or cust_id is null" to the first delete. This will definitely reduce the number of reads, and may also significantly decrease the amount of data written. (Oracle writes blocks, not rows, so even if the two statements work with different rows they may be re-writing the same blocks.)
  • It's probably quicker to insert the entire table into another table than to update every row. Oracle does a lot of extra work for updates and deletes, to maintain concurrency and consistency. And updating values to NULL can be especially expensive, see update x set y = null takes a long time for some more details. You can avoid (almost all) UNDO and REDO with direct-path inserts: make sure the table is in NOLOGGING mode (or the database is in NOARCHIVELOG mode), and insert using the APPEND hint.
  • Replace the UPDATEs with MERGEs. UPDATEs can only use nested loops, MERGEs can also use hash joins. If you're updating a large amount of data a MERGE can be significantly faster. And MERGEs don't have to read a table twice if it's used for the SET and for a EXISTS. (Although creating a new table may also be faster.)
  • Use /*+ APPEND */ with the TAB_000000003133 insert. If you're truncating the table, I assume you don't need point-in-time recovery of the data, so you might as well insert it directly to the datafile and skip all the overhead.
  • Use parallelism (if you're not already). There are side-affects and dozens of factors to consider for tuning, but don't let that discourage you. If you're dealing with large amounts of data, sooner or later you'll need to use parallelism if you want to get the most out of your hardware.
  • Use better names. This advice is more subjective, but in my opinion I think using good names is extremely important. Even though it's all 0s and 1s at some level, and many programmers think that cryptic code is cool, you want people to understand and care about your data. People just won't care as much about TAB_000000003135 as something like TAB_CUSTOMER_CODES. It'll be harder to learn, people are less likely to change it because it looks so complicated, and people are less likely to see errors because the purpose isn't as clear.

Upvotes: 1

Related Questions