Foobar
Foobar

Reputation: 8485

How to select pairs of rows from a table in sqlite?

I have the following schema:

    CREATE TABLE video_segments (
        video_id TEXT,
        segment_num INTEGER,
        data BLOB
    )

I want to select 100 pairs of rows from the table, such that each pair contains 2 segments from the same video, and no 2 pairs are from the same video.

What would be the best query for this?

I tried doing:

WITH CTE AS (
  SELECT video_id
  FROM video_segments
  GROUP BY video_id
  HAVING COUNT(*) >= 2
)
SELECT *
FROM video_segments
WHERE video_id IN (
  SELECT video_id
  FROM CTE
  ORDER BY RANDOM()
  LIMIT 100
)
ORDER BY RANDOM()
LIMIT 200;

but this did not work because sometimes there would be 0 rows with a given video id, and sometimes there would be more than 2 rows with a video id.

Sampled data:

video_id|segment_num|data|
foo      0            <bin>
foo      1            <bin>
foo      2            <bin>
foo      3            <bin>
bar      0            <bin>
bar      1            <bin>
baz      0            <bin>
baz      1            <bin>
baz      2            <bin>

Lets say I wanted to select 3 random pairs. A valid result might be:

foo 0 <bin>
foo 2 <bin>
bar 0 <bin>
bar 1 <bin>
baz 0 <bin>
baz 2 <bin>

since results should be random. The result should be different each time.

Upvotes: 0

Views: 135

Answers (2)

forpas
forpas

Reputation: 164184

Use window function ROW_NUMBER() with random ordering:

WITH cte AS (
  SELECT *, 
         ROW_NUMBER() OVER (PARTITION BY video_id ORDER BY RANDOM()) rn,
         COUNT(*) OVER (PARTITION BY video_id) cnt
  FROM video_segments
)  
SELECT video_id, segment_num, data  
FROM cte
WHERE cnt >= 2 AND rn <= 2
ORDER BY video_id;

See the demo.

Upvotes: 0

Jon Armstrong
Jon Armstrong

Reputation: 4694

Something like this should do:

The fiddle

  1. Assume vid <= 10 are the videos chosen and
  2. We wish to select 2 sid's per selected vid.
  3. Randomly assign a row number to each sid within each vid separately.
  4. Then pick the first 2 row numbers (ord < 3) for each of those vids

Note: The first recursive CTE term is only to create test data for segments.

WITH segments (sid, vid) AS (
        SELECT 0  , 1 UNION ALL
        SELECT sid+1, 1 + (sid+1)/10 FROM segments WHERE sid < 100
     )
   , cte1 AS (
        SELECT t.*, row_number() OVER (PARTITION BY vid ORDER BY random()) AS ord
          FROM segments AS t
         WHERE vid <= 10
     )
SELECT * FROM cte1
 WHERE ord < 3
 ORDER BY vid, ord
;

Now applying that to your schema and initial logic, we have something like this:

The updated fiddle

WITH cte AS (
      SELECT video_id
        FROM video_segments
       GROUP BY video_id
      HAVING COUNT(*) >= 2
     )
   , cte1 AS (
      SELECT *, row_number() OVER (PARTITION BY video_id ORDER BY random()) AS ord
        FROM video_segments
       WHERE video_id IN (
                SELECT video_id
                  FROM cte
                 ORDER BY RANDOM()
                 LIMIT 100
             )
     )
SELECT * FROM cte1
 WHERE ord < 3
 ORDER BY video_id, ord
;

Fiddle updated with the new data

Result 1:

video_id segment_num data ord
bar 1 null 1
bar 0 null 2
baz 0 null 1
baz 2 null 2
foo 1 null 1
foo 0 null 2

Result 2:

video_id segment_num data ord
bar 1 null 1
bar 0 null 2
baz 1 null 1
baz 0 null 2
foo 2 null 1
foo 3 null 2

etc.

The larger set of data generated in the first fiddle test case shows the random behavior a bit better.

Upvotes: 2

Related Questions