deseosuho
deseosuho

Reputation: 1038

SQL Server: table join based on record-dependent values

There's a general type of query I'm trying to perform, and I'm not sure how to express it in words so that I can find a discussion of best practices and examples for executing it.

Here's an example use case.

I have a customers table that has info about customers and an orders table. I want to fetch a subset of records from orders based on customer characteristics, limited by the "earliest" and "latest" dates contained as data in the customers table. It's essential to the solution that I limit my query results to within this date range, which varies by customer.

CUSTOMERS

+------------+------------+----------+---------------------+-------------------+
| CustomerID |  Location  | Industry | EarliestActiveOrder | LatestActiveOrder |
+------------+------------+----------+---------------------+-------------------+
|        001 | New York   | Finance  | 2017-11-03          | 2019-07-30        |
|        002 | California | Tech     | 2018-06-18          | 2019-09-22        |
|        003 | New York   | Finance  | 2015-09-30          | 2019-02-26        |
|        004 | California | Finance  | 2019-02-02          | 2019-08-15        |
|        005 | New York   | Finance  | 2017-10-19          | 2018-12-20        |
+------------+------------+----------+---------------------+-------------------+

ORDERS

+----------+------------+------------+---------+
| OrderID | CustomerID | StartDate  | Details |
+----------+------------+------------+---------+
|     5430 |        003 | 2015-06-30 |     ... |
|     5431 |        003 | 2016-03-31 |     ... |
|     5432 |        003 | 2018-09-30 |     ... |
|     5434 |        001 | 2018-11-05 |     ... |
|     5435 |        001 | 2019-10-11 |     ... |

A sample use case expressed in words would be: "Give me all Active Orders from Finance customers in New York".

Desired result is to return the full records from orders table for OrderID's 5431,5432,5434.

What is a generally good approach for structuring this kind of query, given an orders table with ~10^6 records?

Upvotes: 2

Views: 62

Answers (2)

GMB
GMB

Reputation: 222482

You are looking for a join:

select o.*
from orders o
inner join customers c 
    on  c.Customer_id = o.Customer_id
    and o.StartDate between c.EarliestActiveOrder  and c.LatestActiveOrder
    and c.Industry = 'Finance'
    and c.Location = 'New York'

For performance in this query, consider the following indexes:

orders(customer_id,  StartDate)
customers(Customer_id, Industry, Location, EarliestActiveOrder, LatestActiveOrder)

Upvotes: 2

Gordon Linoff
Gordon Linoff

Reputation: 1269953

Assuming that the result set is a small subset of the orders (say less then 1% of orders but the 1% is for illustration), I would phrase the query like this:

select o.*
from customers c join
     orders o
     on o.Customer_id = c.Customer_id and
        o.StartDate between c.EarliestActiveOrder  and c.LatestActiveOrder
where c.Location = 'New York' and c.industry = 'Finance';

The indexing strategy is tricky. For smallish result sets, you probably want to restrict the customers first and then find the matching orders. This approach suggsts indexes on:

  • customers(location, industry, customer_id, EarliestActiveOrder, LatestActiveOrder)
  • orders(customer_id, startdate)

If you had other columns for filtering, you would need separate indexes for them. For instance, for industry-only filtering:

  • customers(industry, customer_id, EarliestActiveOrder, LatestActiveOrder)

This can get cumbersome.

If, on the other hand, your result set is likely to be a significant number of orders, then scanning the orders table might be more efficient. You can try to rely on the optimizer. Or just push it in the right direction by phrasing the query as:

select o.*
from orders o
where exists (select 1
              from customers c
              where o.Customer_id = c.Customer_id and
                    o.StartDate between c.EarliestActiveOrder  and c.LatestActiveOrder and
                    c.Location = 'New York' and c.industry = 'Finance'
             );

In this case, you want an index on customers(customer_id) -- but that is probably already the primary key so you are fine. This has the advantage that you don't need to worry about the exact filtering criteria. The downside is a full table scan on orders (but not additional work for a join, group by, or order by).

Upvotes: 1

Related Questions