tefran
tefran

Reputation: 45

Google BigQuery: How to query the count of shared values between two different values?

I’m working with some newsletter data and I’m trying to show the number of users that are signed up for more than one newsletter (overlap of users between lists). I’m using newsletter data in Google BigQuery and DataStudio to visualize.

My dataset returns one row for each user + newsletter combination. So if a user is signed up for three different newsletters it will show:

+---+------------+--------------+
|   | Name       |  Newsletter  |
+---+------------+--------------+
| 1 | User A     | Newsletter 1 |
| 2 | User A     | Newsletter 2 |
| 3 | User A     | Newsletter 3 |
+---+------------+--------------+

I'm limiting the overlap combinations input to 2.

MY QUESTION IS: How would I query my original data set to return the count of overlapping users for all the possible combinations? I'm sure there's some way to do it with a variety of CASE statements, but that seems tedious and inefficient. Wondering if there might be an easier way that I'm not thinking of.

More background on how I'm thinking of visualizing, which would impact result:

Because I'm limiting each combination to 2 newsletters, I thought that a heatmap may be a good way to try to show this data, with the number of users on both lists in each “intersection.” enter image description here

But in order to do this in DataStudio, I would need the data to appear in a format like this: enter image description here

It’s like the result of a CROSS JOIN, where all the different combinations are represented in the column A & column B. So there is overlap, but this is the necessary setup to visualize it (at least that's the only way I think it's doable).

So if I'm going to use this specific method, how would I query my data set to return it in this format?

Also open to other ideas about how to think about / visualize this particular situation, but wanted to pose my specific question.

Upvotes: 1

Views: 385

Answers (1)

Mikhail Berlyant
Mikhail Berlyant

Reputation: 173046

Looks like you are looking for something as below

#standardSQL
SELECT A, B, IFNULL(`Count`, 0) AS `Count`
FROM (
  SELECT DISTINCT t1.Newsletter AS A, t2.Newsletter AS B 
  FROM `project.dataset.table` t1, `project.dataset.table` t2
) LEFT JOIN (
  SELECT t1.Newsletter AS A, t2.Newsletter AS B, COUNT(1) AS `Count` 
  FROM `project.dataset.table` t1, `project.dataset.table` t2
  WHERE t1.Name = t2.Name AND t1.Newsletter != t2.Newsletter
  GROUP BY A, B
) USING (A, B)

Upvotes: 1

Related Questions