Reputation: 468

Retrieving a large dataset from a very large table

I have a very large table in oracle that contains 140+ million rows. Currently we are doing three full table scans on this table nightly, and using some of the results to populate a tmp table. That tmp table is then turned into a very large report (usually 140K + lines).

The big table is called tasklog and has the following structure has: tasklog_id (number) - PK document_id (number) date_time_in (date) + a few more rows that aren't relevant

There are millions of different document ids each repeated between 1 and several hundred times, date_time_in is the time this entry was put into the database.

All of the full table scans looks like this

DECLARE
n_prevdocid     number;

cursor tasks is
   select * 
   from tasklog
   order by document_id, date_time_in DESC;

BEGIN

for tk in tasks
loop
    if n_prevdocid <> tk.document_id then
         -- *code snipped*

    end if;
    n_prevdocid = tk.document_id;
end loop;

END;
/

So my question: is there a quick (ish) way to get a distinct list of document_ids with the row having the most recent date_time_in. This could dramatically speed up the whole thing. Or can anyone think of a better way of retrieving this data daily?

Things that may be relevant, this table only ever has rows inserted with current date time. It is not range paritioned but I can't see how that might help me. No rows are ever updated or deleted. There are about 70k - 80k rows inserted daily.

Upvotes: 1

Answers (3)

David Aldridge

Reputation: 52396

I don't think that you're going to get away from doing at least one full table scan, as the only way that it would be efficient would be is if the ratio of distinct document_id's to total records was pretty small. The clustering on the document_id is going to be very poor due to the way that the data is generated and inserted.

How about:

create table tmp nologging compress -- or pctfree 0
as
select ...
from   (
  select t.*,
         max(date_time_in) over (partition by document_id) max_date_time_in
  from   tasklog t)
where   date_time_in = max_date_time_in

Possibly, having created this once, you could then optimise further refreshes by merging into this set only the newer records. Something like ...

merge into tmp
using (
  select ...
  from   (
    select t.*,
           max(date_time_in) over (partition by document_id) max_date_time_in
    from   tasklog t
    where  date_time_in > (select max(date_time_in) from tmp))
  where   date_time_in = max_date_time_in)
on ... blah blah

Upvotes: 1

sourcecode

Reputation: 1802

You can do something like this:

select document_id , date_time from tasklog group by date_time,document_id order by date_time desc;

By this you can retrieve distinct document_id with latest date_time colums.

Upvotes: 0

Rene

Reputation: 10551

Have you tried:

select document_id
from tasklog t1
where date_time_in = (select max(date_time_in)
                      from tasklog t2
                      where t1.document_id=t2.document_id)

Upvotes: 0

Retrieving a large dataset from a very large table

Answers (3)

Related Questions