Reputation: 61
I have a Spring batch that will run everyday to :
Read CSV files and import them into our database
Aggregate this data and save these aggregated data into another table.
We have a table BATCH_LIST
that contains information about all the batchs that were already executed.
BATCH_LIST
has the following columns :
1. BATCH_ID
2. EXECUTION_DATE
3. STATUS
Among the CSV files that are imported, we have one CSV file to feed a APP_USERS
table, and another one to feed the ACCOUNTS
table.
APP_USERS
has the following columns :
1. USER_ID
2. BATCH_ID
-- more columns
ACCOUNTS
has the following columns :
1. ACCOUNT_ID
2. BATCH_ID
-- more columns
In step 2, we aggregate data from ACCOUNTS
and APP_USERS
to insert rows into a USER_ACCOUNT_RELATION
table. This table has exactly two columns : ACCOUNT_ID
(refering to ACCOUNTS.ACCOUNT_ID
) and USER_ID
(refering to APP_USERS.USER_ID
).
Now we want to add another step in our Spring batch. We want to delete all the data from USER_ACCOUNT_RELATION
table but also APP_USERS
and ACCOUNTS
that are no longer relevant (ie data that was imported before sysdate - 2
.
What has been done so far :
Get all the BATCH_ID
that we want to remove from the database
SELECT BATCH_ID FROM BATCH_LIST WHERE trunc(EXECUTION_DATE) < sysdate - 2
For each BATCH_ID
, we are calling the following methods :
public void deleteAppUsersByBatchId(Connection connection, long batchId) throws SQLException
// prepared statements to delete User account relation and user
And here are the two prepared statements :
DELETE FROM USER_ACCOUNT_RELATION
WHERE USER_ID IN (
SELECT USER_ID FROM APP_USERS WHERE BATCH_ID = ?
);
DELETE FROM APP_USERS WHERE BATCH_ID = ?
My issue is that it takes too long to delete data for one BATCH_ID
(more than 1 hour).
Note : I only mentioned the APP_USERS
, ACCOUNTS
AND USER_ACCOUNT_RELATION
tables, but I actually have around 25 tables to delete.
How can I improve the query time ?
(I've just tried to change the WHERE USER_ID IN ()
into an EXISTS
. It is better but still way too long.
Upvotes: 0
Views: 181
Reputation: 8655
If that will be your regular process, ie you want to store only last 2 days, you don't need indexes, since every time you will delete 1/3 of all rows.
It's better to use just 3 deletes instead of 3*7 separate deletes:
DELETE FROM USER_ACCOUNT_RELATION
WHERE ACCOUNT_ID IN
(
SELECT u.ID
FROM {USER} u
join {FILE} f
on u.FILE_ID = f.file
WHERE trunc(f.IMPORT_DATE) < (sysdate - 2)
);
DELETE FROM {USER}
WHERE FILE_ID in (select FILE_ID from {file} where trunc(IMPORT_DATE) < (sysdate - 2));
DELETE FROM {ACCOUNT}
WHERE FILE_ID in (select FILE_ID from {file} where trunc(IMPORT_DATE) < (sysdate - 2));
Just replace {USER}
, {FILE}
, {ACCOUNT}
with your real table names.
Obviously in case of partitioning option it would be much easier - daily interval partitioning, so you could easily drop old partitions.
But even in your case, there is also another more difficult but really fast solution - "partition views": for example for ACCOUNT, you can create 3 different tables ACCOUNT_1, ACCOUNT_2 and ACCOUNT_3, then create partition view:
create view ACCOUNT as
select 1 table_id, a1.* from ACCOUNT_1 a1
union all
select 2 table_id, a2.* from ACCOUNT_2 a2
union all
select 3 table_id, a3.* from ACCOUNT_3 a3;
Then you can use instead of trigger on this view to insert daily data into own table: first day into account_1,second - account_2, etc. And truncate old table each midnight. You can easily get table name using
select 'ACCOUNT_'|| (mod(to_char(sysdate, 'j'),3)+1) tab_name from dual;
Upvotes: 1