sewey
sewey

Reputation: 61

Massive Delete statement - How to improve query execution time?

I have a Spring batch that will run everyday to :

  1. Read CSV files and import them into our database

  2. Aggregate this data and save these aggregated data into another table.

We have a table BATCH_LIST that contains information about all the batchs that were already executed.

BATCH_LIST has the following columns :

 1. BATCH_ID
 2. EXECUTION_DATE
 3. STATUS

Among the CSV files that are imported, we have one CSV file to feed a APP_USERS table, and another one to feed the ACCOUNTS table.

APP_USERS has the following columns :

 1. USER_ID
 2. BATCH_ID
 -- more columns

ACCOUNTS has the following columns :

 1. ACCOUNT_ID
 2. BATCH_ID
 -- more columns

In step 2, we aggregate data from ACCOUNTS and APP_USERS to insert rows into a USER_ACCOUNT_RELATION table. This table has exactly two columns : ACCOUNT_ID (refering to ACCOUNTS.ACCOUNT_ID) and USER_ID (refering to APP_USERS.USER_ID).

Now we want to add another step in our Spring batch. We want to delete all the data from USER_ACCOUNT_RELATION table but also APP_USERS and ACCOUNTS that are no longer relevant (ie data that was imported before sysdate - 2.

What has been done so far :

  1. Get all the BATCH_ID that we want to remove from the database

    SELECT BATCH_ID FROM BATCH_LIST WHERE trunc(EXECUTION_DATE) < sysdate - 2
    
  2. For each BATCH_ID, we are calling the following methods :

     public void deleteAppUsersByBatchId(Connection connection, long batchId) throws SQLException  
      // prepared statements to delete User account relation and user
    
    

And here are the two prepared statements :

DELETE FROM USER_ACCOUNT_RELATION 
WHERE USER_ID IN (
   SELECT USER_ID FROM APP_USERS WHERE BATCH_ID = ?
);
DELETE FROM APP_USERS WHERE BATCH_ID = ?

My issue is that it takes too long to delete data for one BATCH_ID (more than 1 hour).

Note : I only mentioned the APP_USERS, ACCOUNTS AND USER_ACCOUNT_RELATION tables, but I actually have around 25 tables to delete.

How can I improve the query time ? (I've just tried to change the WHERE USER_ID IN () into an EXISTS. It is better but still way too long.

Upvotes: 0

Views: 181

Answers (1)

Sayan Malakshinov
Sayan Malakshinov

Reputation: 8655

If that will be your regular process, ie you want to store only last 2 days, you don't need indexes, since every time you will delete 1/3 of all rows.

It's better to use just 3 deletes instead of 3*7 separate deletes:

DELETE FROM USER_ACCOUNT_RELATION 
WHERE ACCOUNT_ID IN
(
  SELECT u.ID 
  FROM {USER} u 
  join {FILE} f
       on u.FILE_ID = f.file
  WHERE trunc(f.IMPORT_DATE) < (sysdate - 2)
);

DELETE FROM {USER}
  WHERE FILE_ID in (select FILE_ID from {file} where trunc(IMPORT_DATE) < (sysdate - 2));
  
DELETE FROM {ACCOUNT}
  WHERE FILE_ID in (select FILE_ID from {file} where trunc(IMPORT_DATE) < (sysdate - 2));

Just replace {USER}, {FILE}, {ACCOUNT} with your real table names.

Obviously in case of partitioning option it would be much easier - daily interval partitioning, so you could easily drop old partitions.

But even in your case, there is also another more difficult but really fast solution - "partition views": for example for ACCOUNT, you can create 3 different tables ACCOUNT_1, ACCOUNT_2 and ACCOUNT_3, then create partition view:

create view ACCOUNT as
select 1 table_id, a1.* from ACCOUNT_1 a1
union all
select 2 table_id, a2.* from ACCOUNT_2 a2
union all
select 3 table_id, a3.* from ACCOUNT_3 a3;

Then you can use instead of trigger on this view to insert daily data into own table: first day into account_1,second - account_2, etc. And truncate old table each midnight. You can easily get table name using

select 'ACCOUNT_'|| (mod(to_char(sysdate, 'j'),3)+1) tab_name from dual;

Upvotes: 1

Related Questions