lord.fist
lord.fist

Reputation: 435

Bigquery overlapping value distribution query

I would like to know if it is possible (and how) to get minimum distinct value overlap in range of values.

For instance lets say I have 3 bags of XY values in them, I would like a single (this is important, I know how to do it for each bag separately) query that gives me a percentage of exclusive values (in relation to all other bags) in each bag.

Here is an example

BAG | VALUE
1 | 100
1 | 102
1 | 100
2 | 100
2 | 101
2 | 101
3 | 103
3 | 103
3 | 102
3 | 104

So what I would get here is:

BAG | MINIMUM EXCLUSIVE VALUES
1 | 0  (no items here are exclusive)
2 | 0.5 (only item 101 is exclusive in this bag and since distinct count of all items in this bag is 2, 50% of the bag is exclusive)
3 | 0.666666 (items 103 and 104 are exclusive to this bag and since distinct count of all items in the bag is 3 this gives 66,66666% of exclusive items)

Is there any way to make this with single bigquery query (single as in I don't need to rewrite this query for each bag in a set since there can be quite a big number of bags). Of course query can have subqueries but it should not be tied (hardcoded) for each bag.

Upvotes: 2

Views: 438

Answers (4)

lord.fist
lord.fist

Reputation: 435

select max_bag as  BAG,
                sum(exclusive) / MAX(b.valuesPerBag) * 100 as percentValue,
                MAX(b.valuesPerBag) as countValues
from (
                select VALUE,
                               max( BAG) as max_bag,
                               if(min( BAG) == max( BAG), 1, 0) as exclusive
                from {TABLE_LIST}
                group by VALUE
) a
left join (
  select  BAG, integer(count( distinct VALUE)) as valuesPerBag
  from {TABLE_LIST}
  group by  BAG
) b on a.max_bag = b.BAG
group by  BAG
order by  BAG desc

Ok so this seems to be a viable solution based on upper ones. And it is quite optimized as well. I haven't done all tests yet so I won't put it as accepted answer but from a few tests we made this seems correct.

EDIT: Done some testing, this was the fastest solution on 8mil items and 500 bags.

Upvotes: 0

Radek Michna
Radek Michna

Reputation: 509

try this:

select max_bag, sum(exclusive) / count(*) from
(select value, max(bag) as max_bag, if(min(bag) == max(bag), 1, 0) as exclusive
from 
  (Select 0 as BAG, 0 AS VALUE),
  (Select 1 as BAG, 100 AS VALUE),
  (Select 1 as BAG, 102 AS VALUE),
  (Select 1 as BAG, 100 AS VALUE),
  (Select 2 as BAG, 100 AS VALUE),
  (Select 2 as BAG, 101 AS VALUE),
  (Select 2 as BAG, 101 AS VALUE),
  (Select 3 as BAG, 103 AS VALUE),
  (Select 3 as BAG, 103 AS VALUE),
  (Select 3 as BAG, 102 AS VALUE),
  (Select 3 as BAG, 104 AS VALUE)
group by value
)
group by max_bag

the only problem is that this will omit results where the distribution is equal to zero (bag 1 in this case). I hope this should be able to process your data in seconds (you might need use group each by).

EDIT:

select bag, sum(exclusive) / max(dist_in_bag) from
(select bag, value, dist_in_bag, if(x == y, 1, 0) as exclusive from
(select bag, value, dist_in_bag, min(bag) over(partition by value) as x, max(bag) over(partition by value) as y from
(select bag, value, count(*) over(partition by bag) as dist_in_bag from
(Select 0 as BAG, 0 AS VALUE), 
(Select 1 as BAG, 100 AS VALUE), 
(Select 1 as BAG, 102 AS VALUE), 
(Select 1 as BAG, 100 AS VALUE), 
(Select 2 as BAG, 100 AS VALUE), 
(Select 2 as BAG, 101 AS VALUE), 
(Select 2 as BAG, 101 AS VALUE), 
(Select 3 as BAG, 108 AS VALUE), 
(Select 3 as BAG, 101 AS VALUE), 
(Select 3 as BAG, 101 AS VALUE), 
(Select 0 as BAG, 103 AS VALUE), 
(Select 0 as BAG, 103 AS VALUE), 
(Select 0 as BAG, 102 AS VALUE), 
(Select 0 as BAG, 104 AS VALUE), 
(Select 4 as BAG, 203 AS VALUE), 
(Select 4 as BAG, 203 AS VALUE), 
(Select 4 as BAG, 202 AS VALUE), 
(Select 4 as BAG, 204 AS VALUE)
group by bag, value
)))
group by bag

Upvotes: 0

Rohit
Rohit

Reputation: 3157

Building on top of @N.N.'s solution:

SELECT BAG, SUM(is_unique)/MAX(CVB) as MINIMUM_EXCLUSIVE_VALUES
FROM
    (
    SELECT BAG, VALUE, MAX(IF(CBV=1,1,0)) as is_unique , MAX(CVB) as CVB
    FROM
        (
        SELECT BAG,
                VALUE,
                Count(Distinct BAG) OVER(Partition BY VALUE) as CBV, 
                Count(Distinct VALUE) OVER(Partition BY BAG) as CVB
         FROM 
          (Select 1 as BAG, 100 AS VALUE),
          (Select 1 as BAG, 102 AS VALUE),
          (Select 1 as BAG, 100 AS VALUE),
          (Select 2 as BAG, 100 AS VALUE),
          (Select 2 as BAG, 101 AS VALUE),
          (Select 2 as BAG, 101 AS VALUE),
          (Select 3 as BAG, 103 AS VALUE),
          (Select 3 as BAG, 103 AS VALUE),
          (Select 3 as BAG, 102 AS VALUE),
          (Select 3 as BAG, 104 AS VALUE),
         ORDER BY BAG
         )
    GROUP BY BAG, VALUE
)
GROUP BY BAG

Upvotes: 3

N.N.
N.N.

Reputation: 3172

Select BAG,VALUE, IF(CNT_BagsPerValue>1,0,CNT/CNT_ValuesPerBag) as MIN_EXCLUSIVE_VALUES
FROM
(Select BAG,VALUE,CNT_BagsPerValue,CNT_ValuesPerBag,Count(*) as CNT
FROM
(
Select BAG,VALUE,--Count(*) as CNT,
Count(Distinct BAG) OVER(Partition BY VALUE) as CNT_BagsPerValue
, Count(Distinct VALUE) OVER(Partition BY BAG) as CNT_ValuesPerBag
from 
(Select 1 as BAG, 100 AS VALUE),
(Select 1 as BAG, 102 AS VALUE),
(Select 1 as BAG, 100 AS VALUE),
(Select 2 as BAG, 100 AS VALUE),
(Select 2 as BAG, 101 AS VALUE),
(Select 2 as BAG, 101 AS VALUE),
(Select 3 as BAG, 103 AS VALUE),
(Select 3 as BAG, 103 AS VALUE),
(Select 3 as BAG, 102 AS VALUE),
(Select 3 as BAG, 104 AS VALUE),
)
GROUP BY BAG,VALUE,CNT_BagsPerValue,CNT_ValuesPerBag,)

Upvotes: 1

Related Questions