Reputation: 575
Ok, so I'm working on this (rather old) project at work, which uses loads of queries towards an Oracle database. I recently stumbled upon this gem, which takes about 6-7 hours to run and returns ~1400 rows. The table/view in question contains ~200'000 rows. I thought this felt like it was taking maybe a little longer than seemed reasonable, so I started having a closer look at it. Now I can't, for security/proprietary reasons, share the exact query, but this should show what the query does in more general terms:
SELECT
some_field,
some_other_field
FROM (
SELECT
*
FROM
some_view a
WHERE
some_criteria AND
a.client_no || ':' || a.engagement_no || ':' || a.registered_date = (
SELECT
b.client_no || ':' || b.engagement_no || ':' || MAX(b.registered_date)
FROM
some_view b
JOIN some_engagement_view e
ON e.client_no = b.client_no AND e.engagement_no = b.engagement_no
JOIN some_client_view c
ON c.client_no = b.client_no
WHERE
some_other_criteria AND
b.client_no = a.client_no AND
b.engagement_no = a.engagement_no
GROUP BY
b.client_no,
b.engagement_no
)
);
Basically what it is supposed to do, as far as I've managed to figure out, is to from some_view (which contains evaluations of customers/engagements) fetch the latest evaluation for every unique client/engagement.
The two joins are there to ensure that the client and engagement exists in another system, where they are primarily handled after you have done the evaluation in this system.
Notice how it concatenates two numbers and a date, and then compares that to a sub-query? "Interesting" design-choice. So I thought that if you replace the concatenation with a proper comparison you might get some kind of performance gain at least. Please notice that I primarily develop .NET and for the web, and am far from an expert when it comes to databases, but I rewrote it as follows:
SELECT
some_field,
some_other_filed
FROM
some_view a
WHERE
some_criteria AND
(a.client_no, a.engagement_no, a.registered_date) = (
SELECT
b.client_no,
b.engagement_no,
MAX(b.registered_date)
FROM
some_view b
JOIN some_engagement_view e
ON e.client_no = b.client_no AND e.engagement_no = b.engagement_no
JOIN some_client_view c
ON c.client_no = b.client_no
WHERE
some_other_criteria AND
b.client_no = a.client_no AND
b.engagement_no = a.engagement_no
GROUP BY
b.client_no,
b.engagement_no
)
);
Now if I replace the fields in the very first select with a COUNT(1)
, I get exactly the same number of rows with both queries, so a good start. The new query fetches data just as fast as it counts, < 10 seconds. The old query gets the count in ~20 seconds, and as I mentioned before, the data takes close to 6-7 hours. It is currently running so that I can do some kind of analysis to see if the new query is valid, but I thought that I'd ask here as well to see if there is anything apparently wrong that I have done?
EDIT Also removed the outer-most query, which did not seem to fulfill any kind of purpose, except maybe making the query look cooler.. or something.. I dunno..
Upvotes: 1
Views: 157
Reputation: 191285
Expanding on my comment... if I try to replicate your query structure using built-in views it also runs for a long time. For example, getting the most recently created table for each owner (purely for demo purposes, it can be done more simply) like this takes several minutes, with either version:
SELECT
owner,
object_name
FROM
all_objects a
WHERE
(a.owner, a.object_type, TRUNC(a.created)) = (
SELECT
b.owner, b.object_type, TRUNC(MAX(b.created))
FROM
all_objects b
JOIN all_tables e
ON e.owner = b.owner and e.table_name = b.object_name
JOIN all_users c
ON c.username = b.owner
WHERE
b.owner = a.owner AND
b.object_type = a.object_type
GROUP BY
b.owner,
b.object_type
);
If I rewrite that to avoid the self-join on all_objects
(equivalent to some_view
in your example) by using an analytic function instead:
SELECT
owner,
object_name
FROM (
SELECT
a.owner,
a.object_name,
row_number() over (partition by a.owner, a.object_type
order by a.created desc) as rn
FROM
all_objects a
JOIN all_tables e
ON e.owner = a.owner and e.table_name = a.object_name
JOIN all_users c
ON c.username = a.owner
)
WHERE
rn = 1;
... then it takes a few seconds.
Now, in this case I don't get exactly the same output because I have multiple objects created at the same time (within the same second as far as created
is concerned).
I don't know the precision of the values stored in your registered_date
of course. So you might need to look at different functions, possibly rank
rather than row_number
, or adjust the ordering to deal with ties if necessary.
rank() over (partition by a.owner, a.object_type
order by trunc(a.created) desc) as rn
...
WHERE
rn = 1;
gives me the same results (well, almost; the join to all_tables
is also skewing things, as I seem to have tables listed in all_objects
that aren't in all_tables
, but that's a side issue). Or max
could work too:
max(created) over (partition by a.owner, a.object_type) as mx
...
WHERE
TRUNC(created) = TRUNC(mx)
In both of those I'm using trunc
to get everything on the same day; you may not need to if your registered_date
doesn't have a time component.
But of course, check you do actually get the same results.
Upvotes: 1