Skyline29
Skyline29

Reputation: 45

MYSQL: Table update on consecute dates

Good evening all,

I've been wrestling with this one all day.

I'm attempting to update a column in my customer table that will show the number of times they have been a subscriber. This table is created from a big data dump that has individual rows for every customer, and every month they were a subscriber (allocated an iterating number for each month). The names are unique. It looks something like this and shows that (for example) Jane Doe was a subscriber in period 1000, but not in period 1002.

Row_ID Customer_Name  Date_Code 
1      Jane Doe       1000      
2      Jane Doe       1001      
3      Jane Doe       1004      
4      Jane Doe       1005      
5      Ted Jones      1000      
6      Ted Jones      1001      
7      Ted Jones      1002      
etc...

In this case Jane Doe was a subscriber from 1000-1001, left our subscription, and then came back from 1004-1005. I have a master table that includes all the date logic (start date, end date, date code etc...). It looks roughly like this:

Start_Date   End_Date    Date_Code
1990-01-01   1990-03-31  1000
1990-04-01   1990-06-30  1001
1990-07-01   1990-09-30  1002
1990-10-01   1990-12-31  1003
etc...

I'm trying to find a way to make the output something like:

Customer_Name  Subscription_Count
Jane Doe       2
Ted Jones      1

Has anyone run into something like this before? It's obvious to me (as a human) that the numbers are (or not) consecutive and are (or not) a representation of the entire sample, but I'm not sure how to make MYSQL understand it. I appreciate any ideas.

*EDIT - I tried both the Join and Where Not Exists alternatives, and both timed out after 10 minutes. I believe it's due to the size of the main table (~100,000 lines). Do you have any suggestions? Thanks again for the all the comments.

**EDIT #2 - After adding indices and tweaking my tables a bit, both solutions work great. Thanks again for the support in figuring this out.

Upvotes: 1

Views: 130

Answers (2)

user1191247
user1191247

Reputation: 12973

I cannot be 100% sure that this is still the case but LEFT JOIN / IS NULL is generally faster than NOT EXISTS in MySQL -

SELECT t1.customer_name, COUNT(*) AS subscriptions
FROM   tbl t1
LEFT JOIN tbl t2
    ON t1.customer_name = t2.customer_name
    AND t1.date_code + 1 = t2.date_code
WHERE t2.customer_name IS NULL
GROUP BY t1.customer_name

UPDATE The addition of a composite index across these two fields instead of the two single column indices gives a significant performance boost -

CREATE UNIQUE INDEX `UQ_customer_date_code` ON tbl (customer_name, date_code);

I have done some tests using a test table with 1.6 million records (100k customers across 21 date_codes). With this index added the query time reduces by about 80%. Using the LEFT JOIN instead of NOT EXISTS only reduces query time by about 15%.

Upvotes: 1

Erwin Brandstetter
Erwin Brandstetter

Reputation: 657052

Query could look something like this:

SELECT customer_name, count(*) AS subscriptions
FROM   tbl AS t
WHERE NOT EXISTS (
    SELECT *
    FROM tbl AS t1
    WHERE t1.customer_name = t.customer_name
    AND t1.date_code = t.date_code + 1
    )
GROUP BY customer_name;

The trick here is to exclude all rows but one per series of date_codes of a customer and then count: only the last row per block does not have a successor (date_code + 1).

I am assuming that successive date_codes form one subscription (as per my first comment to the question). Consequently the additional information on Start_Date and End_Date is not needed.


Performance

LEFT JOIN / IS NULL should in fact be a bit faster than NOT EXISTS in MySQL (as @nnichols supplied).
The much more important thing for performance are indexes. For this to be fast you need indexes on customer_name and on date_code. Like this:

CREATE INDEX tbl_customer_name ON tbl(customer_name);
CREATE INDEX tbl_date_code ON tbl(date_code);

Upvotes: 1

Related Questions