Reputation: 45
Good evening all,
I've been wrestling with this one all day.
I'm attempting to update a column in my customer table that will show the number of times they have been a subscriber. This table is created from a big data dump that has individual rows for every customer, and every month they were a subscriber (allocated an iterating number for each month). The names are unique. It looks something like this and shows that (for example) Jane Doe was a subscriber in period 1000, but not in period 1002.
Row_ID Customer_Name Date_Code
1 Jane Doe 1000
2 Jane Doe 1001
3 Jane Doe 1004
4 Jane Doe 1005
5 Ted Jones 1000
6 Ted Jones 1001
7 Ted Jones 1002
etc...
In this case Jane Doe was a subscriber from 1000-1001, left our subscription, and then came back from 1004-1005. I have a master table that includes all the date logic (start date, end date, date code etc...). It looks roughly like this:
Start_Date End_Date Date_Code
1990-01-01 1990-03-31 1000
1990-04-01 1990-06-30 1001
1990-07-01 1990-09-30 1002
1990-10-01 1990-12-31 1003
etc...
I'm trying to find a way to make the output something like:
Customer_Name Subscription_Count
Jane Doe 2
Ted Jones 1
Has anyone run into something like this before? It's obvious to me (as a human) that the numbers are (or not) consecutive and are (or not) a representation of the entire sample, but I'm not sure how to make MYSQL understand it. I appreciate any ideas.
*EDIT - I tried both the Join and Where Not Exists alternatives, and both timed out after 10 minutes. I believe it's due to the size of the main table (~100,000 lines). Do you have any suggestions? Thanks again for the all the comments.
**EDIT #2 - After adding indices and tweaking my tables a bit, both solutions work great. Thanks again for the support in figuring this out.
Upvotes: 1
Views: 130
Reputation: 12973
I cannot be 100% sure that this is still the case but LEFT JOIN / IS NULL is generally faster than NOT EXISTS in MySQL -
SELECT t1.customer_name, COUNT(*) AS subscriptions
FROM tbl t1
LEFT JOIN tbl t2
ON t1.customer_name = t2.customer_name
AND t1.date_code + 1 = t2.date_code
WHERE t2.customer_name IS NULL
GROUP BY t1.customer_name
UPDATE The addition of a composite index across these two fields instead of the two single column indices gives a significant performance boost -
CREATE UNIQUE INDEX `UQ_customer_date_code` ON tbl (customer_name, date_code);
I have done some tests using a test table with 1.6 million records (100k customers across 21 date_codes). With this index added the query time reduces by about 80%. Using the LEFT JOIN instead of NOT EXISTS only reduces query time by about 15%.
Upvotes: 1
Reputation: 657052
Query could look something like this:
SELECT customer_name, count(*) AS subscriptions
FROM tbl AS t
WHERE NOT EXISTS (
SELECT *
FROM tbl AS t1
WHERE t1.customer_name = t.customer_name
AND t1.date_code = t.date_code + 1
)
GROUP BY customer_name;
The trick here is to exclude all rows but one per series of date_codes of a customer and then count: only the last row per block does not have a successor (date_code + 1
).
I am assuming that successive date_codes form one subscription (as per my first comment to the question). Consequently the additional information on Start_Date
and End_Date
is not needed.
LEFT JOIN / IS NULL
should in fact be a bit faster than NOT EXISTS
in MySQL (as @nnichols supplied).
The much more important thing for performance are indexes. For this to be fast you need indexes on customer_name
and on date_code
. Like this:
CREATE INDEX tbl_customer_name ON tbl(customer_name);
CREATE INDEX tbl_date_code ON tbl(date_code);
Upvotes: 1