Reputation: 101
I have a table with a running history of customer interactions named fact_interactions
. Every time a customer is contacted, a new record is created with specific details about the interaction. Here is an example:
inter_id |customer_id |business_id |department_id |datetime_local |outcome_id |
---------|------------|------------|--------------|--------------------|-----------|
46032383 |1 |112 |1916 |2015-01-14 19:54:20 |48 |
55740863 |2 |2 |3358 |2015-05-06 12:02:12 |19 |
49512895 |3 |160 |396 |2015-01-22 11:57:17 |19 |
51822751 |3 |160 |396 |2015-01-28 13:46:19 |19 |
23533190 |4 |132 |425 |2015-03-26 12:42:24 |19 |
69354240 |5 |164 |3061 |2015-03-30 11:01:43 |19 |
61417848 |5 |164 |3061 |2015-04-01 14:36:30 |19 |
74948424 |5 |164 |3061 |2015-04-28 15:12:42 |19 |
75303296 |5 |164 |3061 |2015-04-29 13:51:02 |10 |
76071776 |5 |164 |3061 |2015-05-01 09:18:39 |10 |
For each record, I need to find all the rows that match by multiple conditions, across multiple time windows. Here's a sample of my query with a few of the different subqueries I'm currently using:
SELECT
inter_id,
(SELECT COUNT(*) FROM fact_interactions B
WHERE B.customer_id = A.customer_id
AND B.business_id = A.business_id
AND B.department_id = A.department_id
AND B.datetime_local::date = A.datetime_local::date
AND B.datetime_local < A.datetime_local) AS cnt_samesamesame_day0
(SELECT COUNT(*) FROM fact_interactions B
WHERE B.customer_id = A.customer_id
AND B.business_id = A.business_id
AND B.department_id <> A.department_id
AND B.datetime_local::date = A.datetime_local::date
AND B.datetime_local < A.datetime_local) AS cnt_samesamediff_day0
(SELECT COUNT(*) FROM fact_interactions B
WHERE B.customer_id = A.customer_id
AND B.business_id <> A.business_id
AND B.department_id <> A.department_id
AND B.datetime_local::date = A.datetime_local::date
AND B.datetime_local < A.datetime_local) AS cnt_samediffdiff_day0
FROM fact_interactions A;
In total I have 180 subqueries for the counts I'm trying to calculate. So if fact_interaction
has 1,000,000 records the output would also have 1,000,000 records, but would have the inter_id
plus 180 count columns. Here are some examples of what those 180 count subqueries would be named to give some further explanation:
The query is capable of finishing, but as you can imagine it takes a very long time. Just calculating the cnt_samesamesame_day0
takes over a minute.
It's tough to include a sample of what the output would look like because it's extremely sparse.
Any suggestions on how to do this more efficiently? Specific examples are much appreciated, but even a better general approach would be amazing. Thank you!
(I'm attempting to implement this on an Amazon Redshift cluster)
Upvotes: 1
Views: 442
Reputation: 1270371
I might suggest that you learn about window functions. For instance:
SELECT inter_id,
COUNT(*) OVER (PARTITION BY customer_id, business_id, department_id, department_id, datetime_local
ORDER BY datetime_local
) as cnt_samesamesame_day0,
. . .
There are probably similar constructs for the other columns.
Upvotes: 1