Reputation: 1
Suppose we have a table with columns X and Y where Y is the total count of values present in X.
Column X | Column Y |
---|---|
3000 | 23 |
8000 | 50 |
4000 | 20 |
9000 | 70 |
5000 | 64 |
How to split the data with 8:1:1 ratio of column Y.
Example: Y is 23. Therefore 8:1:1 of Y will be nearly 18,2,3. There will be 18 rows for train, 3 rows for test and 2 rows for val.
Similarly 8:1:1 of 64 will be 51,7 and 6.
Expected output table is like this:
Column X | Column Y | Column Z |
---|---|---|
3000 | 1 | Train |
3000 | . | Train |
3000 | . | Train |
3000 | 18 | Train |
3000 | 1 | Test |
3000 | . | Test |
3000 | 3 | Test |
3000 | 1 | Val |
3000 | 2 | Val |
8000 | 1 | Train |
8000 | . | Train |
8000 | 40 | Train |
8000 | 1 | Test |
8000 | . | Test |
8000 | 5 | Test |
8000 | 1 | Val |
8000 | . | Val |
8000 | 5 | Val |
4000 | 1 | Train |
4000 | . | Train |
4000 | . | Train |
4000 | 16 | Train |
4000 | 1 | Test |
4000 | 2 | Test |
4000 | 1 | Val |
4000 | 2 | Val |
5000 | 1 | Train |
5000 | . | Train |
5000 | 51 | Train |
5000 | 1 | Test |
5000 | . | Test |
5000 | . | Test |
5000 | 7 | Test |
5000 | 1 | Val |
5000 | . | Val |
5000 | 6 | Val |
To summarize, I want to split the all rows in train, test, val set in proportion of 8:1:1 using column Y value.
I tried using Pandas for similar task, but unable to do it in SQL
Upvotes: 0
Views: 137
Reputation: 3089
Here's a brute force method that will work. I had to make an assumption on what you mean by 8:1:1, but result in integers:
Val8 = FLOOR(ColumnY * 0.8)
Val1a = FLOOR(ColumnY * 0.1)
Val1b = ColumnY - Val8 - Val1a
So, you may need to adjust if you have clearer requirements.
The code below is written for SQL Server. Among other possible differences, other RDBMSs (like Oracle) may require WITH RECURSIVE
for the recursive CTE to process without error.
with a as (
select *
from (
values (1, 3000, 23)
, (2, 8000, 50)
, (3, 4000, 20)
, (4, 9000, 70)
, (5, 5000, 64)
) t (id, ColumnX, ColumnY)
),
b as (
select id
, 'Train' as ColumnZ
, ColumnX
, cast(ColumnY * 0.8 as int) as ColumnY
from a
union all
select id
, 'Test' as ColumnZ
, ColumnX
, cast(ColumnY * 0.1 as int) as ColumnY
from a
),
c as (
select b.id
, 'Val' as ColumnZ
, b.ColumnX
, a.ColumnY - sum(b.ColumnY) as ColumnY
from b
inner join a on a.ColumnX = b.ColumnX
group by b.id
, b.ColumnX
, a.ColumnY
),
iterator as (
select 1 as ColumnY
, max(ColumnY) as m
from c
union all
select ColumnY + 1
, m
from iterator
where ColumnY <= m
)
select b.ColumnX
, i.ColumnY
, b.ColumnZ
from b
inner join iterator i on i.ColumnY <= b.ColumnY
union all
select c.ColumnX
, i.ColumnY
, c.ColumnZ
from c
inner join iterator i on i.ColumnY <= c.ColumnY
order by 1, 3, 2
Upvotes: 0