MySQL - Find Orders placed by repeat customers vs. new customers over time

Question

I have an orders table that contains the orders_id, customers_email_address and date_purchased. I want to write a SQL query that will, for each line in the table, add a new field called 'repeat_order_count' that shows how many times this customer ordered before and including this order.

For example, if John ordered once before this order, the repeat_order_count would be 2 for this order, or in other words, this is the second time John has ordered. The next order row I encounter for John will have a 3, and so on. This will allow me to create a line graph that shows the number of orders placed by repeat customers over time. I can now go to a specific time in the past and figure out how many orders were placed by repeat customers during that time period:

SELECT 
      * 
FROM orders 
WHERE repeat_order_count > 1 
WHERE date_purchased = January 2014 --(simplifying things here)

I'm also able to determine now WHEN a customer became a repeat customer.

I can't figure out the query to solve this. Or perhaps there may be an easier way to do this?

spencer7593 · Accepted Answer

One approach to retrieving the specified result would be to use a correlated subquery in the SELECT list. This assumes that the customer identifier is customers_email_address, and that date_purchased is a DATETIME or TIMESTAMP (or other canonical format), and that there are no duplicated values for the same customer (that is, the customer doesn't have two or more orders with the same date_purchased value.)

SELECT s.orders_id
     , s.customers_email_address
     , s.date_purchased 
     , ( SELECT COUNT(1)
           FROM orders p
          WHERE p.customers_email_address = s.customers_email_address
            AND p.date_purchased          < s.date_purchased
       ) AS previous_order_count
  FROM orders s
 ORDER
    BY s.customers_email_address
     , s.date_purchased

The correlated subquery will return 0 for the "first" order for a customer, and 1 for the "second" order. If you want to include the current order in the count, replace the < comparison operator with <= operator.

FOLLOWUP

For performance of that query, we need to be particulary concerned with the performance of the correlated subquery, since that is going to be executed for every row in the table. (A million rows in the table means a million executions of that query.) Having a suitable index available is going to be crucial.

For the query in my answer, I'd recommend trying an index like this:

ON orders (customers_email_address, date_purchased, orders_id)

With that index in place, we'd expect EXPLAIN to show the index being used by both the outer query, to satisfy the ORDER BY (No "Using filesort" in the Extra column), and as a covering index (no lookups to the pages in the underlying table, "Using index" shown in the Extra column.)

The answer I gave demonstrated just one approach. It's also possible to return an equivalent result using a join pattern, for example:

SELECT s.orders_id
     , s.customers_email_address
     , s.date_purchased
     , COUNT(p.orders_id)
  FROM orders s
  JOIN orders p
    ON p.customers_email_address = s.customers_email_address
   AND p.date_purchased         <= s.date_purchased
 GROUP
    BY s.customers_email_address
     , s.date_purchased
     , s.orders_id
 ORDER
    BY s.customers_email_address
     , s.date_purchased
     , s.orders_id

(This query is based on some additional information provided in a comment, which wasn't available before: orders_id is UNIQUE in the orders table.)

If we are guaranteed that orders_id of a "previous" order is less than the orders_id of a previous order, then it would be possible to use that column in place of the date_purchased column. We'd want a suitable index available:

... ON orders (customers_email_address, orders_id, date_purchased)

NOTE: The order of the columns in the index is important. With that index, we could do:

SELECT s.orders_id
     , s.customers_email_address
     , s.date_purchased
     , COUNT(p.orders_id)
  FROM orders s
  JOIN orders p
    ON p.customers_email_address = s.customers_email_address
   AND p.orders_id              <= s.orders_id
 GROUP
    BY s.customers_email_address
     , s.orders_id
 ORDER
    BY s.customers_email_address
     , s.orders_id

Again, we'd want to review the output from EXPLAIN to verify that the index is being used for both the join operation and the GROUP BY operation.

NOTE: With the inner join, we need to use a <= comparison, so we get at least one matching row back. We could either subtract 1 from that result, if we wanted a count of only "previous" orders (not counting the current order), or we could use an outer join operation with a < comparison, so we could get a row back with a count of 0.

MySQL - Find Orders placed by repeat customers vs. new customers over time

Answers (2)

Related Questions