country_dev
country_dev

Reputation: 605

SQL join to see if a row exists in multiple tables

So I have 3 tables: Users, Purchases A, and Purchases B. I want to find all users that have made any purchase within the first 30 days of joining. So all users with a purchase from either Purchases A or Purchases B within the first 30 days after created_at.

A few things to note:

What is the best way to approach this with tables this large? I know I can accomplish this with 2 left joins, but I don't need all of the rows from the Purchases tables, I just need to know if any exist.

SELECT
  user_id
  ,name
  ,created_at
FROM
  users u
LEFT JOIN purchases_a pa
ON u.user_id = pa.user_id AND TO_DATE(pa.purchase_date) <= DATEADD(DAY, 30, u.created_at)
LEFT JOIN purchases_b pb
ON u.user_id = pb.user_id AND TO_DATE(pb.purchase_date) <= DATEADD(DAY, 30, u.created_at)

Is a join even the right approach here or would IN or EXISTS be more appropriate?

enter image description here

Upvotes: 0

Views: 1161

Answers (1)

Bill Karwin
Bill Karwin

Reputation: 562871

Queries against data sets this large is bound to take a long time, no matter how you try to optimize it. You are doing a table-scan of 200 million users! That alone would be costly, to say nothing of the join.

The bottom line is that this is not a query to run against a normalized database of that scale, using a database technology optimized for OLTP work.

Maybe it could be done on a data warehouse, but not MySQL. Even on a data warehouse, you should transform the data to support that specific report.

One approach you could do (on MySQL or on a DW) would be to denormalize slightly, by adding an attribute to the users table that stores a boolean which is true for each user if they have made a purchase within 30 days of their creation date. Then you can index that boolean column. Your query can give you the result without examining 200 million rows.

When a user makes a purchase, your code would update the boolean column to true if it's within 30 days of the creation date.

You should also be thinking about sharding, so not all users and their purchases are stored in a single database. MySQL can handle billions of rows, but it's hard to optimize queries at that scale. Splitting the data over multiple MySQL servers will become necessary soon anyway. If you do that, then each server will only need to report on its own subset of the data, so you can run your report in parallel threads, and then combine the results.

Upvotes: 2

Related Questions