Count many distinct combinations of columns with multiple conditions

Question

I have a table with a running history of customer interactions named fact_interactions. Every time a customer is contacted, a new record is created with specific details about the interaction. Here is an example:

inter_id |customer_id |business_id |department_id |datetime_local      |outcome_id |
---------|------------|------------|--------------|--------------------|-----------|
46032383 |1           |112         |1916          |2015-01-14 19:54:20 |48         |
55740863 |2           |2           |3358          |2015-05-06 12:02:12 |19         |
49512895 |3           |160         |396           |2015-01-22 11:57:17 |19         |
51822751 |3           |160         |396           |2015-01-28 13:46:19 |19         |
23533190 |4           |132         |425           |2015-03-26 12:42:24 |19         |
69354240 |5           |164         |3061          |2015-03-30 11:01:43 |19         |
61417848 |5           |164         |3061          |2015-04-01 14:36:30 |19         |
74948424 |5           |164         |3061          |2015-04-28 15:12:42 |19         |
75303296 |5           |164         |3061          |2015-04-29 13:51:02 |10         |
76071776 |5           |164         |3061          |2015-05-01 09:18:39 |10         |

For each record, I need to find all the rows that match by multiple conditions, across multiple time windows. Here's a sample of my query with a few of the different subqueries I'm currently using:

SELECT
    inter_id,
    (SELECT COUNT(*) FROM fact_interactions B
      WHERE B.customer_id = A.customer_id
      AND   B.business_id = A.business_id
      AND   B.department_id = A.department_id
      AND   B.datetime_local::date = A.datetime_local::date
      AND   B.datetime_local < A.datetime_local) AS cnt_samesamesame_day0
    (SELECT COUNT(*) FROM fact_interactions B
      WHERE B.customer_id = A.customer_id
      AND   B.business_id = A.business_id
      AND   B.department_id <> A.department_id
      AND   B.datetime_local::date = A.datetime_local::date
      AND   B.datetime_local < A.datetime_local) AS cnt_samesamediff_day0
    (SELECT COUNT(*) FROM fact_interactions B
      WHERE B.customer_id = A.customer_id
      AND   B.business_id <> A.business_id
      AND   B.department_id <> A.department_id
      AND   B.datetime_local::date = A.datetime_local::date
      AND   B.datetime_local < A.datetime_local) AS cnt_samediffdiff_day0
FROM fact_interactions A;

In total I have 180 subqueries for the counts I'm trying to calculate. So if fact_interaction has 1,000,000 records the output would also have 1,000,000 records, but would have the inter_id plus 180 count columns. Here are some examples of what those 180 count subqueries would be named to give some further explanation:

cnt_samesamesame_day0/day3/day7/...
cnt_samesamediff_day0/day3/day7/...
cnt_samediffdiff_day0/day3/day7/...

The query is capable of finishing, but as you can imagine it takes a very long time. Just calculating the cnt_samesamesame_day0 takes over a minute.

It's tough to include a sample of what the output would look like because it's extremely sparse.

Any suggestions on how to do this more efficiently? Specific examples are much appreciated, but even a better general approach would be amazing. Thank you!

(I'm attempting to implement this on an Amazon Redshift cluster)

Count many distinct combinations of columns with multiple conditions

Answers (1)

Related Questions