Reputation: 4448
I have the following population:
a
b
b
c
c
c
c
I am looking for a SQL statement to generate a the stratified sample of arbitrary size. Let's say for this example, I would like a sample size of 4. I would expect the output to be:
a
b
c
c
Upvotes: 2
Views: 112
Reputation: 26347
select*from population
order by row_number()over(partition by stratum)
limit 4
offset 0;
stratum |
---|
c |
b |
a |
c |
row_number()
.ORDER BY
that.LIMIT
to cut off your sample.OFFSET
to progress through samples.You can use different pagination methods to progress through consecutive, non-overlapping samples of your population. LIMIT..OFFSET
isn't the best, but it's the simplest.
Once it sampled from each group, it picks another member however Postgres finds it quickest. If you want to instead force it to pick them alphabetically (get b
instead of c
as the fourth member drafted to this sample), add another order by
item accordingly as shown in the demo.
To later order the whole extracted sample, you can wrap it in a subquery or a CTE and add another order by
outside so that it sorts the result without affecting how members are sampled.
There are also built-in random sampling methods you can specify with tablesample
clause:
select*from population
tablesample system(50)repeatable(.42)
limit 4;
But those don't operate on data-level strata.
TABLESAMPLE SYSTEM
uses pages. 50 means every page of the table has 50% chance of being drafted. The number of live records on a page isn't constant. This typically gets you neighbouring rows that got inserted together/consecutively. You need to know the total row count of the table and adjust that percentage to it in order to arrive at a specific sample size. You also still need a limit
clause on top, because the exact sample size you'll get is based entirely on probability.TABLESAMPLE BERNOULLI
uses records. With 50, every record of every page has 50% chance. Again, needs to be combined with total row count and trimmed with limit
to arrive at a specific sample size.TABLESAMPLE SYSTEM_TIME
from tsm_system_time
is TABLESAMPLE SYSTEM
but instead of accepting a target sample %, it takes a time limit. It just keeps drafting until it runs out of time.TABLESAMPLE SYSTEM_ROWS
from tsm_system_rows
is like TABLESAMPLE SYSTEM
with LIMIT
applied during sampling - it'll begin drafting page by page until it collects the target sample size.Upvotes: 2
Reputation: 109155
You can use the NTILE
window function to define the number of buckets (or tiles) you want, and then use ROW_NUMBER()
to define the first of the group, and then filter on that:
select col
from (
select col, tile, row_number() over(partition by tile order by col) as rownr
from (
select col, ntile(4) over (order by col) as tile
from (values ('a'), ('b'), ('b'), ('c'), ('c'), ('c'), ('c')) as a(col)
) b
) c
where rownr = 1
See also this dbfiddle.
For this specific example, you can also use the MIN
aggregate function instead of ROW_NUMBER()
:
select min(col) as col
from (
select col, ntile(4) over (order by col) as tile
from (values ('a'), ('b'), ('b'), ('c'), ('c'), ('c'), ('c')) as a(col)
) b
group by tile
order by 1
See also this dbfiddle.
However, the first solution is in my opinion more generally useful.
Upvotes: 0
Reputation: 522506
We can use a count trick here, with the help of window functions:
WITH cte AS (
SELECT t.*, COUNT(*) OVER (PARTITION BY col1) cnt,
ROW_NUMBER() OVER (PARTITION BY col1 ORDER BY col1) rn
FROM yourTable t
)
SELECT col1
FROM cte
WHERE 1.0*rn/cnt <= (4.0 / (SELECT COUNT(*) FROM yourTable))
ORDER BY col1;
The idea is to sequentially number every value, and then retain only a certain percentage.
Upvotes: 1