Shtutz
Shtutz

Reputation: 1

How to split data in ratio using SQL

Suppose we have a table with columns X and Y where Y is the total count of values present in X.

Column X Column Y
3000 23
8000 50
4000 20
9000 70
5000 64

How to split the data with 8:1:1 ratio of column Y.

Example: Y is 23. Therefore 8:1:1 of Y will be nearly 18,2,3. There will be 18 rows for train, 3 rows for test and 2 rows for val.

Similarly 8:1:1 of 64 will be 51,7 and 6.

Expected output table is like this:

Column X Column Y Column Z
3000 1 Train
3000 . Train
3000 . Train
3000 18 Train
3000 1 Test
3000 . Test
3000 3 Test
3000 1 Val
3000 2 Val
8000 1 Train
8000 . Train
8000 40 Train
8000 1 Test
8000 . Test
8000 5 Test
8000 1 Val
8000 . Val
8000 5 Val
4000 1 Train
4000 . Train
4000 . Train
4000 16 Train
4000 1 Test
4000 2 Test
4000 1 Val
4000 2 Val
5000 1 Train
5000 . Train
5000 51 Train
5000 1 Test
5000 . Test
5000 . Test
5000 7 Test
5000 1 Val
5000 . Val
5000 6 Val

To summarize, I want to split the all rows in train, test, val set in proportion of 8:1:1 using column Y value.

I tried using Pandas for similar task, but unable to do it in SQL

Upvotes: 0

Views: 137

Answers (1)

dougp
dougp

Reputation: 3089

Here's a brute force method that will work. I had to make an assumption on what you mean by 8:1:1, but result in integers:

Val8 = FLOOR(ColumnY * 0.8)
Val1a = FLOOR(ColumnY * 0.1)
Val1b = ColumnY - Val8 - Val1a

So, you may need to adjust if you have clearer requirements.

The code below is written for SQL Server. Among other possible differences, other RDBMSs (like Oracle) may require WITH RECURSIVE for the recursive CTE to process without error.

with a as (
  select *
  from (
    values (1, 3000, 23)
         , (2, 8000, 50)
         , (3, 4000, 20)
         , (4, 9000, 70)
         , (5, 5000, 64)
  ) t (id, ColumnX, ColumnY)
),
b as (
  select id
  , 'Train' as ColumnZ
  , ColumnX
  , cast(ColumnY * 0.8 as int) as ColumnY
  from a
  union all
  select id
  , 'Test' as ColumnZ
  , ColumnX
  , cast(ColumnY * 0.1 as int) as ColumnY
  from a
),
c as (
  select b.id
  , 'Val' as ColumnZ
  , b.ColumnX
  , a.ColumnY - sum(b.ColumnY) as ColumnY
  from b
    inner join a on a.ColumnX = b.ColumnX
  group by b.id
  , b.ColumnX
  , a.ColumnY
),
iterator as (
  select 1 as ColumnY
  , max(ColumnY) as m
  from c
  union all
  select ColumnY + 1
  , m
  from iterator
  where ColumnY <= m
)
select b.ColumnX
, i.ColumnY
, b.ColumnZ
from b
  inner join iterator i on i.ColumnY <= b.ColumnY
union all
select c.ColumnX
, i.ColumnY
, c.ColumnZ
from c
  inner join iterator i on i.ColumnY <= c.ColumnY
order by 1, 3, 2

Upvotes: 0

Related Questions