Reputation: 435
I would like to know if it is possible (and how) to get minimum distinct value overlap in range of values.
For instance lets say I have 3 bags of XY values in them, I would like a single (this is important, I know how to do it for each bag separately) query that gives me a percentage of exclusive values (in relation to all other bags) in each bag.
Here is an example
BAG | VALUE
1 | 100
1 | 102
1 | 100
2 | 100
2 | 101
2 | 101
3 | 103
3 | 103
3 | 102
3 | 104
So what I would get here is:
BAG | MINIMUM EXCLUSIVE VALUES
1 | 0 (no items here are exclusive)
2 | 0.5 (only item 101 is exclusive in this bag and since distinct count of all items in this bag is 2, 50% of the bag is exclusive)
3 | 0.666666 (items 103 and 104 are exclusive to this bag and since distinct count of all items in the bag is 3 this gives 66,66666% of exclusive items)
Is there any way to make this with single bigquery query (single as in I don't need to rewrite this query for each bag in a set since there can be quite a big number of bags). Of course query can have subqueries but it should not be tied (hardcoded) for each bag.
Upvotes: 2
Views: 438
Reputation: 435
select max_bag as BAG,
sum(exclusive) / MAX(b.valuesPerBag) * 100 as percentValue,
MAX(b.valuesPerBag) as countValues
from (
select VALUE,
max( BAG) as max_bag,
if(min( BAG) == max( BAG), 1, 0) as exclusive
from {TABLE_LIST}
group by VALUE
) a
left join (
select BAG, integer(count( distinct VALUE)) as valuesPerBag
from {TABLE_LIST}
group by BAG
) b on a.max_bag = b.BAG
group by BAG
order by BAG desc
Ok so this seems to be a viable solution based on upper ones. And it is quite optimized as well. I haven't done all tests yet so I won't put it as accepted answer but from a few tests we made this seems correct.
EDIT: Done some testing, this was the fastest solution on 8mil items and 500 bags.
Upvotes: 0
Reputation: 509
try this:
select max_bag, sum(exclusive) / count(*) from
(select value, max(bag) as max_bag, if(min(bag) == max(bag), 1, 0) as exclusive
from
(Select 0 as BAG, 0 AS VALUE),
(Select 1 as BAG, 100 AS VALUE),
(Select 1 as BAG, 102 AS VALUE),
(Select 1 as BAG, 100 AS VALUE),
(Select 2 as BAG, 100 AS VALUE),
(Select 2 as BAG, 101 AS VALUE),
(Select 2 as BAG, 101 AS VALUE),
(Select 3 as BAG, 103 AS VALUE),
(Select 3 as BAG, 103 AS VALUE),
(Select 3 as BAG, 102 AS VALUE),
(Select 3 as BAG, 104 AS VALUE)
group by value
)
group by max_bag
the only problem is that this will omit results where the distribution is equal to zero (bag 1 in this case). I hope this should be able to process your data in seconds (you might need use group each by).
EDIT:
select bag, sum(exclusive) / max(dist_in_bag) from
(select bag, value, dist_in_bag, if(x == y, 1, 0) as exclusive from
(select bag, value, dist_in_bag, min(bag) over(partition by value) as x, max(bag) over(partition by value) as y from
(select bag, value, count(*) over(partition by bag) as dist_in_bag from
(Select 0 as BAG, 0 AS VALUE),
(Select 1 as BAG, 100 AS VALUE),
(Select 1 as BAG, 102 AS VALUE),
(Select 1 as BAG, 100 AS VALUE),
(Select 2 as BAG, 100 AS VALUE),
(Select 2 as BAG, 101 AS VALUE),
(Select 2 as BAG, 101 AS VALUE),
(Select 3 as BAG, 108 AS VALUE),
(Select 3 as BAG, 101 AS VALUE),
(Select 3 as BAG, 101 AS VALUE),
(Select 0 as BAG, 103 AS VALUE),
(Select 0 as BAG, 103 AS VALUE),
(Select 0 as BAG, 102 AS VALUE),
(Select 0 as BAG, 104 AS VALUE),
(Select 4 as BAG, 203 AS VALUE),
(Select 4 as BAG, 203 AS VALUE),
(Select 4 as BAG, 202 AS VALUE),
(Select 4 as BAG, 204 AS VALUE)
group by bag, value
)))
group by bag
Upvotes: 0
Reputation: 3157
Building on top of @N.N.'s solution:
SELECT BAG, SUM(is_unique)/MAX(CVB) as MINIMUM_EXCLUSIVE_VALUES
FROM
(
SELECT BAG, VALUE, MAX(IF(CBV=1,1,0)) as is_unique , MAX(CVB) as CVB
FROM
(
SELECT BAG,
VALUE,
Count(Distinct BAG) OVER(Partition BY VALUE) as CBV,
Count(Distinct VALUE) OVER(Partition BY BAG) as CVB
FROM
(Select 1 as BAG, 100 AS VALUE),
(Select 1 as BAG, 102 AS VALUE),
(Select 1 as BAG, 100 AS VALUE),
(Select 2 as BAG, 100 AS VALUE),
(Select 2 as BAG, 101 AS VALUE),
(Select 2 as BAG, 101 AS VALUE),
(Select 3 as BAG, 103 AS VALUE),
(Select 3 as BAG, 103 AS VALUE),
(Select 3 as BAG, 102 AS VALUE),
(Select 3 as BAG, 104 AS VALUE),
ORDER BY BAG
)
GROUP BY BAG, VALUE
)
GROUP BY BAG
Upvotes: 3
Reputation: 3172
Select BAG,VALUE, IF(CNT_BagsPerValue>1,0,CNT/CNT_ValuesPerBag) as MIN_EXCLUSIVE_VALUES
FROM
(Select BAG,VALUE,CNT_BagsPerValue,CNT_ValuesPerBag,Count(*) as CNT
FROM
(
Select BAG,VALUE,--Count(*) as CNT,
Count(Distinct BAG) OVER(Partition BY VALUE) as CNT_BagsPerValue
, Count(Distinct VALUE) OVER(Partition BY BAG) as CNT_ValuesPerBag
from
(Select 1 as BAG, 100 AS VALUE),
(Select 1 as BAG, 102 AS VALUE),
(Select 1 as BAG, 100 AS VALUE),
(Select 2 as BAG, 100 AS VALUE),
(Select 2 as BAG, 101 AS VALUE),
(Select 2 as BAG, 101 AS VALUE),
(Select 3 as BAG, 103 AS VALUE),
(Select 3 as BAG, 103 AS VALUE),
(Select 3 as BAG, 102 AS VALUE),
(Select 3 as BAG, 104 AS VALUE),
)
GROUP BY BAG,VALUE,CNT_BagsPerValue,CNT_ValuesPerBag,)
Upvotes: 1