Tim
Tim

Reputation: 101

Count many distinct combinations of columns with multiple conditions

I have a table with a running history of customer interactions named fact_interactions. Every time a customer is contacted, a new record is created with specific details about the interaction. Here is an example:

inter_id |customer_id |business_id |department_id |datetime_local      |outcome_id |
---------|------------|------------|--------------|--------------------|-----------|
46032383 |1           |112         |1916          |2015-01-14 19:54:20 |48         |
55740863 |2           |2           |3358          |2015-05-06 12:02:12 |19         |
49512895 |3           |160         |396           |2015-01-22 11:57:17 |19         |
51822751 |3           |160         |396           |2015-01-28 13:46:19 |19         |
23533190 |4           |132         |425           |2015-03-26 12:42:24 |19         |
69354240 |5           |164         |3061          |2015-03-30 11:01:43 |19         |
61417848 |5           |164         |3061          |2015-04-01 14:36:30 |19         |
74948424 |5           |164         |3061          |2015-04-28 15:12:42 |19         |
75303296 |5           |164         |3061          |2015-04-29 13:51:02 |10         |
76071776 |5           |164         |3061          |2015-05-01 09:18:39 |10         |

For each record, I need to find all the rows that match by multiple conditions, across multiple time windows. Here's a sample of my query with a few of the different subqueries I'm currently using:

SELECT
    inter_id,
    (SELECT COUNT(*) FROM fact_interactions B
      WHERE B.customer_id = A.customer_id
      AND   B.business_id = A.business_id
      AND   B.department_id = A.department_id
      AND   B.datetime_local::date = A.datetime_local::date
      AND   B.datetime_local < A.datetime_local) AS cnt_samesamesame_day0
    (SELECT COUNT(*) FROM fact_interactions B
      WHERE B.customer_id = A.customer_id
      AND   B.business_id = A.business_id
      AND   B.department_id <> A.department_id
      AND   B.datetime_local::date = A.datetime_local::date
      AND   B.datetime_local < A.datetime_local) AS cnt_samesamediff_day0
    (SELECT COUNT(*) FROM fact_interactions B
      WHERE B.customer_id = A.customer_id
      AND   B.business_id <> A.business_id
      AND   B.department_id <> A.department_id
      AND   B.datetime_local::date = A.datetime_local::date
      AND   B.datetime_local < A.datetime_local) AS cnt_samediffdiff_day0
FROM fact_interactions A;

In total I have 180 subqueries for the counts I'm trying to calculate. So if fact_interaction has 1,000,000 records the output would also have 1,000,000 records, but would have the inter_id plus 180 count columns. Here are some examples of what those 180 count subqueries would be named to give some further explanation:

The query is capable of finishing, but as you can imagine it takes a very long time. Just calculating the cnt_samesamesame_day0 takes over a minute.

It's tough to include a sample of what the output would look like because it's extremely sparse.

Any suggestions on how to do this more efficiently? Specific examples are much appreciated, but even a better general approach would be amazing. Thank you!

(I'm attempting to implement this on an Amazon Redshift cluster)

Upvotes: 1

Views: 442

Answers (1)

Gordon Linoff
Gordon Linoff

Reputation: 1270371

I might suggest that you learn about window functions. For instance:

SELECT inter_id,
       COUNT(*) OVER (PARTITION BY customer_id, business_id, department_id, department_id, datetime_local
                      ORDER BY datetime_local
                     ) as cnt_samesamesame_day0,
       . . . 

There are probably similar constructs for the other columns.

Upvotes: 1

Related Questions