Niraj Choubey
Niraj Choubey

Reputation: 4040

Update n random rows in SQL

I have table which is having about 1000 rows.I have to update a column("X") in the table to 'Y' for n ramdom rows. For this i can have following query

update xyz set X='Y' when m in (
'SELECT m FROM (SELECT m
FROM xyz
order by dbms_random.value
) RNDM 
where rownum < n+1);

Is there another efficient way to write this query. The table has no index. Please help?

Upvotes: 6

Views: 11840

Answers (3)

0xdb
0xdb

Reputation: 3697

The following solution works just fine. It's performant and seems to be similar to sample():

create table t1 as 
    select level id, cast ('item'||level as varchar2(32)) item 
    from dual connect by level<=100000; 

Table T1 created.

update t1 set item='*'||item 
where exists (
    select rnd from (
        select dbms_random.value() rnd
        from t1
    ) t2 where t2.rowid = t1.rowid and rnd < 0.15
);

14,858 rows updated.

Elapsed: 00:00:00.717 

Consider that alias rnd must be included in select clause. Otherwise changes the omptimizer the filter predicat from RND<0.1 to DBMS_RANDOM.VALUE()<0.1. In that case dbms_random.value will be executed only once.

As mentioned in answer @JonHeller, the best solution remains the pl/sql code block because it allows to avoid full table scan. Here is my suggestion:

create or replace type rowidListType is table of varchar(18);  
/
create or replace procedure updateRandomly (prefix varchar2 := '*') is
    rowidList rowidListType;  
begin  
    select rowidtochar (rowid) bulk collect into rowidList
    from t1 sample(15)
    ;
    update t1 set item=prefix||item 
    where exists (
        select 1 from table (rowidList) t2
        where chartorowid(t2.column_value) = t1.rowid
    );
    dbms_output.put_line ('updated '||sql%rowcount||' rows.'); 
end;
/
begin  updateRandomly; end;
/ 

Elapsed: 00:00:00.293
updated 14892 rows.

Upvotes: 2

Jon Heller
Jon Heller

Reputation: 36807

You can improve performance by replacing the full table scan with a sample.

The first problem you run into is that you can't use SAMPLE in a DML subquery, ORA-30560: SAMPLE clause not allowed. But logically this is what is needed:

UPDATE xyz SET x='Y' WHERE rowid IN (
    SELECT r FROM (
        SELECT ROWID r FROM xyz sample(0.15) ORDER BY dbms_random.value
    ) RNDM WHERE rownum < 100/*n*/+1
);

You can get around this by using a collection to store the rowids, and then update the rows using the rowid collection. Normally breaking a query into separate parts and gluing them together with PL/SQL leads to horrible performance. But in this case you can still save a lot of time by significantly reducing the amount of data read.

declare
    type rowid_nt is table of rowid;
    rowids rowid_nt;
begin
    --Get the rowids
    SELECT r bulk collect into rowids
    FROM (
        SELECT ROWID r
        FROM xyz sample(0.15)
        ORDER BY dbms_random.value
    ) RNDM WHERE rownum < 100/*n*/+1;

    --update the table
    forall i in 1 .. rowids.count
        update xyz set x = 'Y'
        where rowid = rowids(i);
end;
/

I ran a simple test with 100,000 rows (on a table with only two columns), and N = 100. The original version took 0.85 seconds, @Gerrat's answer took 0.7 seconds, and the PL/SQL version took 0.015 seconds.

But that's only one scenario, I don't have enough information to say my answer will always be better. As N increases the sampling advantage is lost, and the writing will be more significant than the reading. If you have a very small amount of data, the PL/SQL context switching overhead in my answer may make it slower than @Gerrat's solution.

For performance issues, the size of the table in bytes is usually much more important than the size in rows. 1000 rows that use a terabyte of space is much larger than 100 million rows that only use a gigabyte.

Here are some problems to consider with my answer:

  1. Sampling does not always return exactly the percent you asked for. With 100,000 rows and a 0.15% sample size the number of rows returned was 147, not 150. That's why I used 0.15 instead of 0.10. You need to over-sample a little bit to ensure that you get more than N. How much do you need to over-sample? I have no idea, you'll probably have to test it and pick a safe number.
  2. You need to know the approximate number of rows to pick the percent.
  3. The percent must be a literal, so as the number of rows and N change, you'll need to use dynamic SQL to change the percent.

Upvotes: 6

Gerrat
Gerrat

Reputation: 29690

I would use the ROWID:

UPDATE xyz SET x='Y' WHERE rowid IN (
    SELECT r FROM (
        SELECT ROWID r FROM xyz ORDER BY dbms_random.value
    ) RNDM WHERE rownum < n+1
)

The actual reason I would use ROWID isn't for efficiency though (it will still do a full table scan) - your SQL may not update the number of rows you want if column m isn't unique.

With only 1000 rows, you shouldn't really be worried about efficiency (maybe with a hundred million rows). Without any index on this table, you're stuck doing a full table scan to select random records.

[EDIT:] "But what if there are 100,000 rows"

Well, that's still 3 orders of magnitude less than 100 million.

I ran the following:

create table xyz as select * from all_objects;

[created about 50,000 rows on my system - non-indexed, just like your table]

UPDATE xyz SET owner='Y' WHERE rowid IN (
     SELECT r FROM (
          SELECT ROWID r FROM xyz ORDER BY dbms_random.value
     ) RNDM WHERE rownum < 10000
);
commit;

This took approximately 1.5 seconds. Maybe it was 1 second, maybe up to 3 seconds (didn't formally time it, it just took about enough time to blink).

Upvotes: 13

Related Questions