Reputation: 8485
I have the following schema:
CREATE TABLE video_segments (
video_id TEXT,
segment_num INTEGER,
data BLOB
)
I want to select 100 pairs of rows from the table, such that each pair contains 2 segments from the same video, and no 2 pairs are from the same video.
What would be the best query for this?
I tried doing:
WITH CTE AS (
SELECT video_id
FROM video_segments
GROUP BY video_id
HAVING COUNT(*) >= 2
)
SELECT *
FROM video_segments
WHERE video_id IN (
SELECT video_id
FROM CTE
ORDER BY RANDOM()
LIMIT 100
)
ORDER BY RANDOM()
LIMIT 200;
but this did not work because sometimes there would be 0 rows with a given video id, and sometimes there would be more than 2 rows with a video id.
Sampled data:
video_id|segment_num|data|
foo 0 <bin>
foo 1 <bin>
foo 2 <bin>
foo 3 <bin>
bar 0 <bin>
bar 1 <bin>
baz 0 <bin>
baz 1 <bin>
baz 2 <bin>
Lets say I wanted to select 3 random pairs. A valid result might be:
foo 0 <bin>
foo 2 <bin>
bar 0 <bin>
bar 1 <bin>
baz 0 <bin>
baz 2 <bin>
since results should be random. The result should be different each time.
Upvotes: 0
Views: 135
Reputation: 164184
Use window function ROW_NUMBER()
with random ordering:
WITH cte AS (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY video_id ORDER BY RANDOM()) rn,
COUNT(*) OVER (PARTITION BY video_id) cnt
FROM video_segments
)
SELECT video_id, segment_num, data
FROM cte
WHERE cnt >= 2 AND rn <= 2
ORDER BY video_id;
See the demo.
Upvotes: 0
Reputation: 4694
Something like this should do:
Note: The first recursive CTE term is only to create test data for segments.
WITH segments (sid, vid) AS (
SELECT 0 , 1 UNION ALL
SELECT sid+1, 1 + (sid+1)/10 FROM segments WHERE sid < 100
)
, cte1 AS (
SELECT t.*, row_number() OVER (PARTITION BY vid ORDER BY random()) AS ord
FROM segments AS t
WHERE vid <= 10
)
SELECT * FROM cte1
WHERE ord < 3
ORDER BY vid, ord
;
Now applying that to your schema and initial logic, we have something like this:
WITH cte AS (
SELECT video_id
FROM video_segments
GROUP BY video_id
HAVING COUNT(*) >= 2
)
, cte1 AS (
SELECT *, row_number() OVER (PARTITION BY video_id ORDER BY random()) AS ord
FROM video_segments
WHERE video_id IN (
SELECT video_id
FROM cte
ORDER BY RANDOM()
LIMIT 100
)
)
SELECT * FROM cte1
WHERE ord < 3
ORDER BY video_id, ord
;
Fiddle updated with the new data
Result 1:
video_id | segment_num | data | ord |
---|---|---|---|
bar | 1 | null | 1 |
bar | 0 | null | 2 |
baz | 0 | null | 1 |
baz | 2 | null | 2 |
foo | 1 | null | 1 |
foo | 0 | null | 2 |
Result 2:
video_id | segment_num | data | ord |
---|---|---|---|
bar | 1 | null | 1 |
bar | 0 | null | 2 |
baz | 1 | null | 1 |
baz | 0 | null | 2 |
foo | 2 | null | 1 |
foo | 3 | null | 2 |
etc.
The larger set of data generated in the first fiddle test case shows the random behavior a bit better.
Upvotes: 2