Reputation: 3160
Our system is facing performance issues selecting rows out of a 38 million rows table.
This table with 38 million rows stores information from clients/suppliers etc. These appear across many other tables, such as Invoices.
The main problem is that our database is far from normalized. The Clients_Suppliers table has a composite key made of 3 columns, the Code - varchar2(16), Category - char(2) and the last one is up_date, a date. Every change in one client's address is stored in that same table with a new date. So we can have records such as this:
code ca up_date
---------------- -- --------
1234567890123456 CL 01/01/09
1234567890123456 CL 01/01/10
1234567890123456 CL 01/01/11
1234567890123456 CL 01/01/12
6543210987654321 SU 01/01/10
6543210987654321 SU 08/03/11
Worst, in every table that uses a client's information, instead of the full composite key, only the code and category is stored. Invoices, for instance, has its own keys, including the emission date. So we can have something like this:
invoice_no serial_no emission code ca
---------- --------- -------- ---------------- --
1234567890 12345 05/02/12 1234567890123456 CL
My specific problem is that I have to generate a list of clients for which invoices where created in a given period. Since I have to get the most recent info from the clients, I have to use max(up_date).
So here's my query (in Oracle):
SELECT
CL.CODE,
CL.CATEGORY,
-- other address fields
FROM
CLIENTS_SUPPLIERS CL
INVOICES I
WHERE
CL.CODE = I.CODE AND
CL.CATEGORY = I.CATEGORY AND
CL.UP_DATE =
(SELECT
MAX(CL2.UP_DATE)
FROM
CLIENTS_SUPPLIERS CL2
WHERE
CL2.CODE = I.CODE AND
CL2.CATEGORY = I.CATEGORY AND
CL2.UP_DATE <= I.EMISSION
) AND
I.EMISSION BETWEEN DATE1 AND DATE2
It takes up to seven hours to select 178,000 rows. Invoices has 300,000 rows between DATE1 and DATE2.
It's a (very, very, very) bad design, and I've raised the fact that we should improve it, by normalizing the tables. That would involve creating a table for clients with a new int primary key for each pair of code/category and another one for Adresses (with the client primary key as a foreign key), then use the Adresses' primary key in each table that relates to clients.
But it would mean changing the whole system, so my suggestion has been shunned. I need to find a different way of improving performance (apparently using only SQL).
I've tried indexes, views, temporary tables but none have had any significant improvement on performance. I'm out of ideas, does anyone have a solution for this?
Thanks in advance!
Upvotes: 1
Views: 1024
Reputation: 1270081
You might try rewriting the query to use analytic functions rather than a correlated subquery:
select *
from (SELECT CL.CODE, CL.CATEGORY, -- other address fields
max(up_date) over (partition by cl.code, cl.category) as max_up_date
FROM CLIENTS_SUPPLIERS CL join
INVOICES I
on CL.CODE = I.CODE AND
CL.CATEGORY = I.CATEGORY and
I.EMISSION BETWEEN DATE1 AND DATE2 and
up_date <= i.emission
) t
where t.up_date = max_up_date
You might want to remove the max_up_date column in the outside select.
As some have noticed, this query is subtly different from the original, because it is taking the max of up_date over all dates. The original query has the condition:
CL2.UP_DATE <= I.EMISSION
However, by transitivity, this means that:
CL2.UP_DATE <= DATE2
So the only difference is when the max of the update date is less than DATE1 in the original query. However, these rows would be filtered out by the comparison to UP_DATE.
Although this query is phrased slightly differently, I think it does the same thing. I must admit to not being 100% positive, since this is a subtle situation on data that I'm not familiar with.
Upvotes: 0
Reputation: 15473
The correlated subquery may be causing issues, but to me the real problem is in what seems to be your main client table, you cannot easily grab the most recent data without doing the max(up_date) mess. Its really a mix of history and current data, and as you describe poorly designed.
Anyway, it will help you in this and other long running joins to have a table/view with ONLY the most recent data for a client. So, first build a mat view for this (untested):
create or replace materialized view recent_clients_view
tablespace my_tablespace
nologging
build deferred
refresh complete on demand
as
select * from
(
select c.*, rownumber() over (partition by code, category order by up_date desc, rowid desc) rnum
from clients c
)
where rnum = 1;
Add unique index on code,category. The assumption is that this will be refreshed periodically on some off hours schedule, and that your queries using this will be ok with showing data AS OF the date of the last refresh. In a DW env or for reporting, this is usually the norm.
The snapshot table for this view should be MUCH smaller than the full clients table with all the history.
Now, you are doing an joining invoice to this smaller view, and doing an equijoin on code,category (where emission between date1 and date2). Something like:
select cv.*
from
recent_clients_view cv,
invoices i
where cv.code = i.code
and cv.category = i.category
and i.emission between :date1 and :date2;
Hope that helps.
Upvotes: 0
Reputation: 67742
Assuming that the number of rows for a (code,ca) is smallish, I would try to force an index scan per invoice with an inline view, such as:
SELECT invoice_id,
(SELECT MAX(rowid) KEEP (DENSE_RANK FIRST ORDER BY up_date DESC
FROM clients_suppliers c
WHERE c.code = i.code
AND c.category = i.category
AND c.up_date < i.invoice_date)
FROM invoices i
WHERE i.invoice_date BETWEEN :p1 AND :p2
You would then join this query to CLIENTS_SUPPLIERS
hopefully triggering a join via rowid (300k rowid read is negligible).
You could improve the above query by using SQL objects:
CREATE TYPE client_obj AS OBJECT (
name VARCHAR2(50),
add1 VARCHAR2(50),
/*address2, city...*/
);
SELECT i.o.name, i.o.add1 /*...*/
FROM (SELECT DISTINCT
(SELECT client_obj(
max(name) KEEP (DENSE_RANK FIRST ORDER BY up_date DESC),
max(add1) KEEP (DENSE_RANK FIRST ORDER BY up_date DESC)
/*city...*/
) o
FROM clients_suppliers c
WHERE c.code = i.code
AND c.category = i.category
AND c.up_date < i.invoice_date)
FROM invoices i
WHERE i.invoice_date BETWEEN :p1 AND :p2) i
Upvotes: 0
Reputation: 21532
SELECT
CL2.CODE,
CL2.CATEGORY,
... other fields
FROM
CLIENTS_SUPPLIERS CL2 INNER JOIN (
SELECT DISTINCT
CL.CODE,
CL.CATEGORY,
I.EMISSION
FROM
CLIENTS_SUPPLIERS CL INNER JOIN INVOICES I ON CL.CODE = I.CODE AND CL.CATEGORY = I.CATEGORY
WHERE
I.EMISSION BETWEEN DATE1 AND DATE2) CL3 ON CL2.CODE = CL3.CODE AND CL2.CATEGORY = CL3.CATEGORY
WHERE
CL2.UP_DATE <= CL3.EMISSION
GROUP BY
CL2.CODE,
CL2.CATEGORY
HAVING
CL2.UP_DATE = MAX(CL2.UP_DATE)
The idea is to separate the process: first we tell oracle to give us the list of clients for which there are the invoices of the period you want, and then we get the last version of them. In your version there's a check against MAX 38000000 times, which I really think is what costed most of the time spent in the query.
However, I'm not asking for indexes, assuming they are correctly setup...
Upvotes: 1
Reputation: 1977
What does the DBA have to say?
Has he/she tried:
I'm not saying the SQL is perfect, but if performance it is degrading over time, the DBA really needs to be having a look at it.
Upvotes: 1