Create table from loop output Oracle SQL

Question

I need to pull a random sample from a table of ~5 million observations based on 175 demographic options. The demographic table is something like this form:

Basically I need this same demographic breakdown randomly sampled from the 5M row table. For each demographic I need a sample of the same one from the larger table but with 5x the number of observations (example: for demographic 1 I want a random sample of 200).

SELECT  *
FROM    (
        SELECT  *
        FROM    my_table
        ORDER BY
                dbms_random.value
        )
WHERE rownum <= 100;

I've used this syntax before to get a random sample but is there any way I can modify this as a loop and substitute variable names from existing tables? I'll try to encapsulate the logic I need in pseudocode:

for (each demographic_COLUMN in TABLE1) 
    select random(5*num_obs_COLUMN in TABLE1) from ID_COLUMN in TABLE2
/*somehow join the results of each step in the loop into one giant column of IDs */

Alex Poole · Accepted Answer

You could join your tables (assuming the 1-175 demographic value exists in both, or there is an equivalent column to join on), something like:

select id
from (
  select d.demographic, d.percentage, t.id,
    row_number() over (partition by d.demographic order by dbms_random.value) as rn
  from demographics d
  join my_table t on t.demographic = d.demographic
)
where rn <= 5 * percentage

Each row in the main table is given a random pseudo-row-number within its demographic (via the analytic row_number()). The outer query then uses the relevant percentage to select how many of those randomly-ordered rows for each demographic to return.

I'm not sure I've understood how you're actually picking exactly how many of each you want, so that probably needs to be adjusted.

Demo with a smaller sample in a CTE, and matching smaller match condition:

-- CTEs for sample data
with my_table (id, demographic) as (
  select level, mod(level, 175) + 1 from dual connect by level <= 175000
),
demographics (demographic, percentage, str) as (
            select 1, 40, '4%' from dual
  union all select 2, 30, '3%' from dual
  union all select 3, 30, '3%' from dual
  -- ...
  union all select 174, 2, '.02%' from dual
  union all select 175, 1, '.01%' from dual
)
-- actual query
select demographic, percentage, id, rn
from (
  select d.demographic, d.percentage, t.id,
    row_number() over (partition by d.demographic order by dbms_random.value) as rn
  from demographics d
  join my_table t on t.demographic = d.demographic
)
where rn <= 5 * percentage;

DEMOGRAPHIC PERCENTAGE         ID         RN
----------- ---------- ---------- ----------
          1         40      94150          1
          1         40      36925          2
          1         40     154000          3
          1         40      82425          4
...
          1         40     154350        199
          1         40     126175        200
          2         30      36051          1
          2         30       1051          2
          2         30     100451          3
          2         30      18026        149
          2         30     151726        150
          3         30     125302          1
          3         30     152252          2
          3         30     114452          3
...
          3         30     104652        149
          3         30      70527        150
        174          2      35698          1
        174          2      67548          2
        174          2     114798          3
...
        174          2      70698          9
        174          2      30973         10
        175          1     139649          1
        175          1     156974          2
        175          1     145774          3
        175          1      97124          4
        175          1      40074          5

(you only need the ID, but I'm including the other columns for context); or more succinctly:

with my_table (id, demographic) as (
  select level, mod(level, 175) + 1 from dual connect by level <= 175000
),
demographics (demographic, percentage, str) as (
            select 1, 40, '4%' from dual
  union all select 2, 30, '3%' from dual
  union all select 3, 30, '3%' from dual
  -- ...
  union all select 174, 2, '.02%' from dual
  union all select 175, 1, '.01%' from dual
)
select demographic, percentage, count(id) as ids, min(id) as min_id, max(id) as max_id
from (
  select d.demographic, d.percentage, t.id,
    row_number() over (partition by d.demographic order by dbms_random.value) as rn
  from demographics d
  join my_table t on t.demographic = d.demographic
)
where rn <= 5 * percentage
group by demographic, percentage
order by demographic;

DEMOGRAPHIC PERCENTAGE        IDS     MIN_ID     MAX_ID
----------- ---------- ---------- ---------- ----------
          1         40        200        175     174825
          2         30        150          1     174126
          3         30        150       2452     174477
        174          2         10      23448     146648
        175          1          5      19074     118649

db<>fiddle

Create table from loop output Oracle SQL

Answers (1)

Related Questions