Reputation: 3160

Select query with join in huge table taking over 7 hours

Our system is facing performance issues selecting rows out of a 38 million rows table.

This table with 38 million rows stores information from clients/suppliers etc. These appear across many other tables, such as Invoices.

The main problem is that our database is far from normalized. The Clients_Suppliers table has a composite key made of 3 columns, the Code - varchar2(16), Category - char(2) and the last one is up_date, a date. Every change in one client's address is stored in that same table with a new date. So we can have records such as this:

code             ca   up_date
---------------- --   --------
1234567890123456 CL   01/01/09
1234567890123456 CL   01/01/10
1234567890123456 CL   01/01/11
1234567890123456 CL   01/01/12
6543210987654321 SU   01/01/10
6543210987654321 SU   08/03/11

Worst, in every table that uses a client's information, instead of the full composite key, only the code and category is stored. Invoices, for instance, has its own keys, including the emission date. So we can have something like this:

invoice_no serial_no emission code             ca
---------- --------- -------- ---------------- --
1234567890 12345     05/02/12 1234567890123456 CL

My specific problem is that I have to generate a list of clients for which invoices where created in a given period. Since I have to get the most recent info from the clients, I have to use max(up_date).

So here's my query (in Oracle):

SELECT
  CL.CODE,
  CL.CATEGORY,
  -- other address fields
FROM
  CLIENTS_SUPPLIERS CL
  INVOICES I
WHERE
  CL.CODE = I.CODE AND
  CL.CATEGORY = I.CATEGORY AND
  CL.UP_DATE = 
    (SELECT
       MAX(CL2.UP_DATE)
     FROM
       CLIENTS_SUPPLIERS CL2
     WHERE
       CL2.CODE = I.CODE AND
       CL2.CATEGORY = I.CATEGORY AND
       CL2.UP_DATE <= I.EMISSION
    ) AND
  I.EMISSION BETWEEN DATE1 AND DATE2

It takes up to seven hours to select 178,000 rows. Invoices has 300,000 rows between DATE1 and DATE2.

It's a (very, very, very) bad design, and I've raised the fact that we should improve it, by normalizing the tables. That would involve creating a table for clients with a new int primary key for each pair of code/category and another one for Adresses (with the client primary key as a foreign key), then use the Adresses' primary key in each table that relates to clients.

But it would mean changing the whole system, so my suggestion has been shunned. I need to find a different way of improving performance (apparently using only SQL).

I've tried indexes, views, temporary tables but none have had any significant improvement on performance. I'm out of ideas, does anyone have a solution for this?

Thanks in advance!

Upvotes: 1

Answers (5)

Gordon Linoff

Reputation: 1270081

You might try rewriting the query to use analytic functions rather than a correlated subquery:

select *
from (SELECT CL.CODE, CL.CATEGORY,   -- other address fields
             max(up_date) over (partition by cl.code, cl.category) as max_up_date
      FROM CLIENTS_SUPPLIERS CL join
           INVOICES I
           on CL.CODE = I.CODE AND
              CL.CATEGORY = I.CATEGORY and
              I.EMISSION BETWEEN DATE1 AND DATE2 and
              up_date <= i.emission
     ) t
where t.up_date = max_up_date

You might want to remove the max_up_date column in the outside select.

As some have noticed, this query is subtly different from the original, because it is taking the max of up_date over all dates. The original query has the condition:

CL2.UP_DATE <= I.EMISSION

However, by transitivity, this means that:

CL2.UP_DATE <= DATE2

So the only difference is when the max of the update date is less than DATE1 in the original query. However, these rows would be filtered out by the comparison to UP_DATE.

Although this query is phrased slightly differently, I think it does the same thing. I must admit to not being 100% positive, since this is a subtle situation on data that I'm not familiar with.

Upvotes: 0

tbone

Reputation: 15473

The correlated subquery may be causing issues, but to me the real problem is in what seems to be your main client table, you cannot easily grab the most recent data without doing the max(up_date) mess. Its really a mix of history and current data, and as you describe poorly designed.

Anyway, it will help you in this and other long running joins to have a table/view with ONLY the most recent data for a client. So, first build a mat view for this (untested):

create or replace materialized view recent_clients_view
tablespace my_tablespace
nologging
build deferred
refresh complete on demand
as
select * from 
(
  select c.*, rownumber() over (partition by code, category order by up_date desc, rowid desc) rnum
  from clients c
)
where rnum = 1;

Add unique index on code,category. The assumption is that this will be refreshed periodically on some off hours schedule, and that your queries using this will be ok with showing data AS OF the date of the last refresh. In a DW env or for reporting, this is usually the norm.

The snapshot table for this view should be MUCH smaller than the full clients table with all the history.

Now, you are doing an joining invoice to this smaller view, and doing an equijoin on code,category (where emission between date1 and date2). Something like:

select cv.*
from 
recent_clients_view cv,
invoices i
where cv.code = i.code
and cv.category = i.category
and i.emission between :date1 and :date2;

Hope that helps.

Upvotes: 0

Vincent Malgrat

Reputation: 67742

Assuming that the number of rows for a (code,ca) is smallish, I would try to force an index scan per invoice with an inline view, such as:

SELECT invoice_id, 
       (SELECT MAX(rowid) KEEP (DENSE_RANK FIRST ORDER BY up_date DESC
          FROM clients_suppliers c
         WHERE c.code = i.code
           AND c.category = i.category
           AND c.up_date < i.invoice_date)
  FROM invoices i
 WHERE i.invoice_date BETWEEN :p1 AND :p2

You would then join this query to CLIENTS_SUPPLIERS hopefully triggering a join via rowid (300k rowid read is negligible).

You could improve the above query by using SQL objects:

CREATE TYPE client_obj AS OBJECT (
   name     VARCHAR2(50),
   add1     VARCHAR2(50),
   /*address2, city...*/
);

SELECT i.o.name, i.o.add1 /*...*/
  FROM (SELECT DISTINCT
               (SELECT client_obj(
                         max(name) KEEP (DENSE_RANK FIRST ORDER BY up_date DESC),
                         max(add1) KEEP (DENSE_RANK FIRST ORDER BY up_date DESC)
                         /*city...*/
                       ) o
                  FROM clients_suppliers c
                 WHERE c.code = i.code
                   AND c.category = i.category
                   AND c.up_date < i.invoice_date)
          FROM invoices i
         WHERE i.invoice_date BETWEEN :p1 AND :p2) i

Upvotes: 0

Sebas

Reputation: 21532

SELECT   
  CL2.CODE,
  CL2.CATEGORY,
  ... other fields
FROM 
  CLIENTS_SUPPLIERS CL2 INNER JOIN (
    SELECT DISTINCT
      CL.CODE,
      CL.CATEGORY,
      I.EMISSION
    FROM
      CLIENTS_SUPPLIERS CL INNER JOIN INVOICES I ON CL.CODE = I.CODE AND CL.CATEGORY = I.CATEGORY
    WHERE
      I.EMISSION BETWEEN DATE1 AND DATE2) CL3 ON CL2.CODE = CL3.CODE AND CL2.CATEGORY = CL3.CATEGORY
WHERE
  CL2.UP_DATE <= CL3.EMISSION
GROUP BY
  CL2.CODE,
  CL2.CATEGORY
HAVING
  CL2.UP_DATE = MAX(CL2.UP_DATE)

The idea is to separate the process: first we tell oracle to give us the list of clients for which there are the invoices of the period you want, and then we get the last version of them. In your version there's a check against MAX 38000000 times, which I really think is what costed most of the time spent in the query.

However, I'm not asking for indexes, assuming they are correctly setup...

Upvotes: 1

Robbie Dee

Reputation: 1977

What does the DBA have to say?

Has he/she tried:

Coalescing the tablespaces
Increasing the parallel query slaves
Moving indexes to a separate tablespace on a separate physical disk
Gathering stats on the relevant tables/indexes
Running an explain plan
Running the query through the index optimiser

I'm not saying the SQL is perfect, but if performance it is degrading over time, the DBA really needs to be having a look at it.

Upvotes: 1

Select query with join in huge table taking over 7 hours

Answers (5)

Related Questions