Reputation: 45
I’m working with some newsletter data and I’m trying to show the number of users that are signed up for more than one newsletter (overlap of users between lists). I’m using newsletter data in Google BigQuery and DataStudio to visualize.
My dataset returns one row for each user + newsletter combination. So if a user is signed up for three different newsletters it will show:
+---+------------+--------------+
| | Name | Newsletter |
+---+------------+--------------+
| 1 | User A | Newsletter 1 |
| 2 | User A | Newsletter 2 |
| 3 | User A | Newsletter 3 |
+---+------------+--------------+
I'm limiting the overlap combinations input to 2.
MY QUESTION IS: How would I query my original data set to return the count of overlapping users for all the possible combinations? I'm sure there's some way to do it with a variety of CASE statements, but that seems tedious and inefficient. Wondering if there might be an easier way that I'm not thinking of.
More background on how I'm thinking of visualizing, which would impact result:
Because I'm limiting each combination to 2 newsletters, I thought that a heatmap may be a good way to try to show this data, with the number of users on both lists in each “intersection.”
But in order to do this in DataStudio, I would need the data to appear in a format like this:
It’s like the result of a CROSS JOIN, where all the different combinations are represented in the column A & column B. So there is overlap, but this is the necessary setup to visualize it (at least that's the only way I think it's doable).
So if I'm going to use this specific method, how would I query my data set to return it in this format?
Also open to other ideas about how to think about / visualize this particular situation, but wanted to pose my specific question.
Upvotes: 1
Views: 385
Reputation: 173046
Looks like you are looking for something as below
#standardSQL
SELECT A, B, IFNULL(`Count`, 0) AS `Count`
FROM (
SELECT DISTINCT t1.Newsletter AS A, t2.Newsletter AS B
FROM `project.dataset.table` t1, `project.dataset.table` t2
) LEFT JOIN (
SELECT t1.Newsletter AS A, t2.Newsletter AS B, COUNT(1) AS `Count`
FROM `project.dataset.table` t1, `project.dataset.table` t2
WHERE t1.Name = t2.Name AND t1.Newsletter != t2.Newsletter
GROUP BY A, B
) USING (A, B)
Upvotes: 1