Reputation: 497

Distribution of data in buckets - Oracle 11g

I have a table with two columns BRANCH and ACTIVITIES, where BRANCH is a unique id of location and ACTIVITIES are number of records belong to respective BRANCH. These records to be distributed in 5 buckets in a way that all buckets should contain almost equal number of records. (no matter if difference is +/-1000)

The challenge is if one branch is selected in a bucket then all activities of same branch will also be selected in same bucket, in other words, number of activities belong to one BRANCH cannot be split. Lets take a very simple example so that I can explain what I am trying to achieve

Total Branches=10
Total Number of activities (records) = 55,000
Average (total activities/total buckets) = 11,000

Sample Data

After Distribution

All buckets contain 11,000 records but things are not such straight forward when we look into real data real data

All Oracle query masters are requested to please look into this. Your expert opinion will highly be appreciated.

Upvotes: 1

Answers (2)

Mahamoutou

Reputation: 1555

You could also try below query. I added some stats columns in this inline view stats_cols_added_tab before I applied dense_rank analytic function to that inline view. Finally I used NTILE analytic function to get five groups.

with sample_data (branch, activities) as (
select 1, 1000 from dual union all
select 2, 2000 from dual union all
select 3, 3000 from dual union all
select 4, 4000 from dual union all
select 5, 5000 from dual union all
select 6, 6000 from dual union all
select 7, 7000 from dual union all
select 8, 8000 from dual union all
select 9, 9000 from dual union all
select 10, 10000 from dual
)
,
stats_cols_added_tab as (
    select s.*
    , count(*)over() total_branches
    , sum(activities)over() total_number_of_activities
    , avg(activities)over() * 2 Average
    , case when row_number()over(order by s.branch) <= count(*)over() / 2 then 1 else 2 end grp
from sample_data s
)
SELECT BRANCH, ACTIVITIES, NTILE(5) OVER (ORDER BY ranked_grp,  BRANCH) AS bucket  
FROM (
        select BRANCH, ACTIVITIES
            , dense_rank()over(
                PARTITION BY grp 
                    order by decode(grp, 1, activities, -1 * activities)
                    ) ranked_grp
        from stats_cols_added_tab t 
    ) t
order by ranked_grp, BRANCH
;

Upvotes: 0

Gordon Linoff

Reputation: 1270431

Unfortunately, this is a bin-packing problem and a "perfect" solution requires -- essentially -- searching through all possible assignments of buckets and then choosing the "best" one. And such an approach is not really suitable for SQL.

For a "good-enough" solution, though, something like a round-robin approach often works well. Simply enumerate the branches from biggest to smallest and assign them to buckets:

select a.branch,
       1 + mod(seqnum, 5) as bucket
from (select a.branch, count(*) as cnt,
             row_number() over (order by count(*)) desc as seqnum
      from activities a
      group by a.branch
     ) a;

Because of the ordering, this is going to generally create buckets of different sizes. So, a slight variation assigns the buckets as 1-2-3-4-5-5-4-3-2-1:

select a.branch,
       (case when mod(seqnum, 10) in (0, 9) then 1
             when mod(seqnum, 10) in (1, 8) then 2
             when mod(seqnum, 10) in (2, 7) then 3
             when mod(seqnum, 10) in (3, 6) then 4
             when mod(seqnum, 10) in (4, 5) then 5
        end) as bucket
from (select a.branch, count(*) as cnt,
             row_number() over (order by count(*)) desc as seqnum
      from activities a
      group by a.branch
     ) a;

Upvotes: 1

Distribution of data in buckets - Oracle 11g

Answers (2)

Related Questions