Reputation: 399
I have a procedure used to scramble transactional data moving live data to a test environment. The table in question holds approx. 100 million rows spread across 50 partitions. A new partition is added every month. As the volume increase, procedure is executing slower than before.
I am looking into introducing some degree of parallelization into my code. This is new territory and I am wondering if there are any best practises. Perhaps use dbms_parallel_execute to split update into chunks?
Any recommendations on how to optimize my code is very much appreciated!
PROCEDURE Scramble_Transactions
AS
vSeed BINARY_INTEGER;
CURSOR Transactions_cur
IS
SELECT T.ID,
T.MONTH_PARTITION,
T.TRACE_NUM,
T.TXTDATA
FROM TRANSACTIONS T;
TYPE TBL IS TABLE OF Transactions_cur%ROWTYPE
INDEX BY PLS_INTEGER;
Transactions_Rec TBL;
vCounter NUMBER (10);
vString VARCHAR2 (300);
vLen NUMBER (5);
vFromRange VARCHAR2 (25);
vToRange VARCHAR2 (25);
BEGIN
vCounter := 0;
SELECT SUBSTR (TO_CHAR (SYSDATE, 'ddmmyyyyhhmiss'), 11)
INTO vSeed
FROM DUAL;
DBMS_RANDOM.initialize (vSeed);
DBMS_RANDOM.SEED (vSeed);
vFromRange := 0;
OPEN Transactions_cur;
LOOP
FETCH Transactions_cur BULK COLLECT INTO Transactions_Rec LIMIT 10000;
FOR I IN 1 .. Transactions_Rec.COUNT
LOOP
IF Transactions_Rec (i).TRACE_NUM IS NOT NULL
THEN
vString := Transactions_Rec (i).TRACE_NUM;
vLen := LENGTH (TRIM (vString));
vToRange := POWER (10, vLen) - 1;
Transactions_Rec (i).TRACE_NUM :=
LPAD (TRUNC (DBMS_RANDOM.VALUE (vFromRange, vToRange)),
6,
'1');
END IF;
IF Transactions_Rec (i).TXTDATA IS NOT NULL
THEN
vString := Transactions_Rec (i).TXTDATA;
vLen := LENGTH (TRIM (vString));
vToRange := POWER (10, vLen) - 1;
Transactions_Rec (i).TXTDATA :=
LPAD (TRUNC (DBMS_RANDOM.VALUE (vFromRange, vToRange)),
12,
'3');
END IF;
vCounter := vCounter + 1;
END LOOP;
FORALL rec IN 1 .. Transactions_Rec.COUNT
UPDATE Transactions
SET TRACE_NUM = Transactions_Rec (rec).TRACE_NUM,
TXTDATA = Transactions_Rec (rec).TXTDATA
WHERE ID = Transactions_Rec (rec).ID
AND MONTH_PARTITION = Transactions_Rec (rec).MONTH_PARTITION;
EXIT WHEN Transactions_cur%NOTFOUND;
END LOOP;
DBMS_RANDOM.TERMINATE;
CLOSE Transactions_cur;
COMMIT;
END Scramble_Transactions;
Edit, my solution based on below feedback: Rewrite part of the procedure so that datascrambling is completed as part of SQL instead of PL/SQL. Procedure now also takes partition from/to as parameters allowing parallel processing.
CREATE OR REPLACE PROCEDURE Scramble_Transactions(P_MONTH_PARTITION_FROM VARCHAR2, P_MONTH_PARTITION_FROM VARCHAR2)
AS
CURSOR Transactions_cur (V_MONTH_PARTITION_FROM TRANSACTIONS.MONTH_PARTITION%TYPE,
V_MONTH_PARTITION_TO TRANSACTIONS.MONTH_PARTITION%TYPE) IS
SELECT T.ID,
T.MONTH_PARTITION,
REGEXP_REPLACE(T.TRACE_NUM,'[0-9]','9') TRACE_NUM,
REGEXP_REPLACE(T.TXTDATA,'[0-9]','9') TXTDATA
FROM TRANSACTIONS T WHERE T.MONTH_PARTITION BETWEEN P_MONTH_PARTITION_FROM AND P_MONTH_PARTITION_FROM ;
TYPE TBL IS TABLE OF Transactions_cur%ROWTYPE
INDEX BY PLS_INTEGER;
Transactions_Rec TBL;
BEGIN
OPEN Transactions_cur(P_MONTH_PARTITION_FROM,P_MONTH_PARTITION_FROM);
LOOP
FETCH Transactions_cur BULK COLLECT INTO Transactions_Rec LIMIT 10000;
/*Some additional processing*/
FORALL rec IN 1 .. Transactions_Rec.COUNT
UPDATE Transactions
SET TRACE_NUM = Transactions_Rec (rec).TRACE_NUM,
TXTDATA = Transactions_Rec (rec).TXTDATA
WHERE ID = Transactions_Rec (rec).ID
AND MONTH_PARTITION = Transactions_Rec (rec).MONTH_PARTITION;
EXIT WHEN Transactions_cur%NOTFOUND;
END LOOP;
CLOSE Transactions_cur;
COMMIT;
END;
/
Now execute the procedure in parallell via use of DBMS_PARALLEL_EXECUTE. Query is split into chunks based on partitionkey.
DECLARE
L_TASK_SQL CLOB;
V_TASKNAME USER_PARALLEL_EXECUTE_TASKS.TASK_NAME%TYPE;
V_STATUS USER_PARALLEL_EXECUTE_TASKS.STATUS%TYPE;
C_TASK_NAME VARCHAR2(50) := 'TRANSACTIONS_TASK';
BEGIN
L_TASK_SQL := 'SELECT PARTITION_NAME, PARTITION_NAME FROM USER_TAB_PARTITIONS WHERE TABLE_NAME = ''TRANSACTIONS''';
DBMS_PARALLEL_EXECUTE.CREATE_TASK(C_TASK_NAME);
DBMS_PARALLEL_EXECUTE.CREATE_CHUNKS_BY_SQL(
TASK_NAME => 'TRANSACTIONS_TASK',
SQL_STMT => L_TASK_SQL,
BY_ROWID => FALSE);
DBMS_PARALLEL_EXECUTE.RUN_TASK(
TASK_NAME => C_TASK_NAME,
SQL_STMT => 'BEGIN SCRAMBLE_TRANSACTIONS( :START_ID, :END_ID ); END;',
LANGUAGE_FLAG => DBMS_SQL.NATIVE,
PARALLEL_LEVEL => 6);
SELECT TASK_NAME, STATUS INTO V_TASKNAME,V_STATUS FROM USER_PARALLEL_EXECUTE_TASKS WHERE TASK_NAME = C_TASK_NAME;
DBMS_OUTPUT.PUT_LINE('TASK:'|| 'V_TASKNAME' ||' , STATUS:'|| V_STATUS);
DBMS_PARALLEL_EXECUTE.DROP_CHUNKS(TASK_NAME => 'TRANSACTIONS_TASK');
DBMS_PARALLEL_EXECUTE.DROP_TASK(TASK_NAME => 'TRANSACTIONS_TASK');
END;
/
Overall total execution is lowered to 30 minutes compared to 13-14 hours previous.
Upvotes: 0
Views: 1186
Reputation: 4424
I think you would be much better off in terms if performance to use a CTAS (create table... as select), or insert /+* append*/... rather than in update. Since your data is partitioned, then you can employ partition exchange. This would allow you to use parallelism much more effectively, together with direct path load operations.
Upvotes: 0
Reputation: 11591
SQL is a good option, but perhaps one very quick fix is that you're updating the same table you are fetching from. That can create huge undo issues because the fetch must give a resultset consistent to a point in time. So each time around the fetch-loop, you might be doing more and more work (undo-ing the updates you've just done). Of course, committing each loop then creates the issue of restartability on error. So maybe do it partition at a time, do it without looping, eg
PROCEDURE Scramble_Transactions(p_parname varchar2) AS
vSeed BINARY_INTEGER;
Transactions_cur sys_refcursor;
CURSOR Transactions_cur_template
IS
SELECT T.ID,
T.MONTH_PARTITION,
T.TRACE_NUM,
T.TXTDATA
FROM TRANSACTIONS T;
TYPE TBL IS TABLE OF Transactions_cur_template%ROWTYPE INDEX BY PLS_INTEGER;
Transactions_Rec TBL;
vCounter NUMBER (10);
vString VARCHAR2 (300);
vLen NUMBER (5);
vFromRange VARCHAR2 (25);
vToRange VARCHAR2 (25);
BEGIN
vCounter := 0;
SELECT SUBSTR (TO_CHAR (SYSDATE, 'ddmmyyyyhhmiss'), 11)
INTO vSeed
FROM DUAL;
DBMS_RANDOM.initialize (vSeed);
DBMS_RANDOM.SEED (vSeed);
vFromRange := 0;
OPEN Transactions_cur for ' SELECT T.ID,
T.MONTH_PARTITION,
T.TRACE_NUM,
T.TXTDATA
FROM TRANSACTIONS T partition ('||p_parname||') where TRACE_NUM IS NOT NULL or TXTDATA IS NOT NULL';
FETCH Transactions_cur BULK COLLECT INTO Transactions_Rec;
FOR I IN 1 .. Transactions_Rec.COUNT
LOOP
IF Transactions_Rec (i).TRACE_NUM IS NOT NULL
THEN
vString := Transactions_Rec (i).TRACE_NUM;
vLen := LENGTH (TRIM (vString));
vToRange := POWER (10, vLen) - 1;
Transactions_Rec (i).TRACE_NUM :=
LPAD (TRUNC (DBMS_RANDOM.VALUE (vFromRange, vToRange)),
6,
'1');
END IF;
IF Transactions_Rec (i).TXTDATA IS NOT NULL
THEN
vString := Transactions_Rec (i).TXTDATA;
vLen := LENGTH (TRIM (vString));
vToRange := POWER (10, vLen) - 1;
Transactions_Rec (i).TXTDATA :=
LPAD (TRUNC (DBMS_RANDOM.VALUE (vFromRange, vToRange)),
12,
'3');
END IF;
vCounter := vCounter + 1;
END LOOP;
FORALL rec IN 1 .. Transactions_Rec.COUNT
UPDATE Transactions
SET TRACE_NUM = Transactions_Rec (rec).TRACE_NUM,
TXTDATA = Transactions_Rec (rec).TXTDATA
WHERE ID = Transactions_Rec (rec).ID
AND MONTH_PARTITION = Transactions_Rec (rec).MONTH_PARTITION;
DBMS_RANDOM.TERMINATE;
CLOSE Transactions_cur;
COMMIT;
END Scramble_Transactions;
So with just a few lines of code changes, we've
You could then submit a job (using say DBMS_SCHEDULER) for each partition name, and because we are now isolating per partition, we won't get contention across the jobs.
Don't get me wrong - a full refactoring in SQL is perhaps still the best option, but in terms of quick wins, the code above might solve your issue with minimal changes done.
Upvotes: 2