Reputation: 42099

Making simple SQL more efficient

I'm having a slow start to the morning. I thought there was a more efficient way to make the following query using a join, instead of two independent selects -- am I wrong?

Keep in mind that I've simplified/reduced my query into this example for SO purposes, so let me know if you have any questions as well.

SELECT DISTINCT c.* 
FROM   customers c
WHERE  c.customer_id IN (select customer_id from customers_cars where car_make = 'BMW')
  AND  c.customer_id IN (select customer_id from customers_cars where car_make = 'Ford')
;

Sample Table Schemas

-- Simple tables to demonstrate point
CREATE TABLE customers (
  customer_id serial,
  name text
  );

CREATE TABLE customers_cars (
  customer_id integer,
  car_make text
  );


-- Populate tables
INSERT INTO customers(name) VALUES
  ('Joe Dirt'),
  ('Penny Price'),
  ('Wooten Nagen'),
  ('Captain Planet')
;

INSERT INTO customers_cars(customer_id,car_make) VALUES
  (1,'BMW'),
  (1,'Merc'),
  (1,'Ford'),
  (2,'BMW'),
  (2,'BMW'),      -- Notice car_make is not unique
  (2,'Ferrari'),
  (2,'Porche'),
  (3,'BMW'),
  (3,'Ford');
-- ids 1 and 3 both have BMW and Ford

Other Expectations

There are ~20 car_make in the database
There are typically 1-3 car_make per customer_id
There is expected to be not more than 50 car_make assignments per customer_id (generally 20-30)
The query is generally only going to look for 2-3 specific car_make per customer (e.g., BMW and Ford), but not 10-20

Upvotes: 2

Answers (4)

gmarintes

Reputation: 1308

It seems to me that you are trying to find customers that has at least 1 BMW and at least 1 Ford car. This query should get that for you:

SELECT
     customers.customer_id
FROM
    customers
        INNER JOIN customer_cars ON
            customers.customer_id = customer_cars.customers_id
            AND customer_cars.car_make IN ('BMW', 'Ford')
GROUP BY
    customers.customer_id
HAVING
    COUNT(CASE WHEN car_make = 'BMW' THEN 1 ELSE NULL END) > 0
    AND COUNT(CASE WHEN car_make = 'Ford' THEN 1 ELSE NULL END) > 0

Make sure you have an indexes on customer_cars.customer_id and customer_cars.car_make to achieve maximum performance.

Upvotes: 1

MK.

Reputation: 34537

I would write it as

SELECT DISTINCT c.customer_id 
FROM   customers c
JOIN   customers_cars cc_f on c.customer_id = cc_f.customer_id and cc_f.car_make = 'Ford'
JOIN   customers_cars cc_b on c.customer_id = cc_b.customer_id and cc_b.car_make = 'BMW'
;

Whether this is better or not I don't know. In some RDBMs plain joins like this work better than subqueries, but I don't know about Postgres. From readability point of view it is also questionable.

Upvotes: 1

Erwin Brandstetter

Reputation: 656431

You don't need to join to customers at all (given relational integrity).

Generally, this is a case of relational division. We assembled an arsenal of techniques under this related question:

How to filter SQL results in a has-many-through relation

Unique combinations

If (customer_id, car_make) was defined unique in customers_cars, it would get much simpler:

SELECT customer_id
FROM   customers_cars
WHERE  car_make IN ('BMW', 'Ford')
GROUP  BY 1
HAVING count(*) = 2;

Combinations not unique

Since (customer_id, car_make) is not unique, we need an extra step.

For only a few cars, your original query is not that bad. But (especially with duplicates!) EXISTS is typically faster than IN, and we don't need the final DISTINCT:

SELECT customer_id -- no DISTINCT needed.
FROM   customers c
WHERE  EXISTS (SELECT 1 FROM customers_cars WHERE customer_id = c.customer_id AND car_make = 'BMW')
AND    EXISTS (SELECT 1 FROM customers_cars WHERE customer_id = c.customer_id AND car_make = 'Ford');

Above query gets verbose and less efficient for a longer list of cars. For an arbitrary number of cars I suggest:

SELECT customer_id
FROM  (
   SELECT customer_id, car_make
   FROM   customers_cars
   WHERE  car_make IN ('BMW', 'Ford')
   GROUP  BY 1, 2
   ) sub
GROUP  BY 1
HAVING count(*) = 2;

SQL Fiddle.

Upvotes: 0

Frank Heikens

Reputation: 126991

And here another option, don't know what the fastest one would be on large tables.

SELECT  customers.*
FROM    customers
    JOIN customers_cars USING(customer_id)
WHERE   car_make = ANY(ARRAY['BMW','Ford'])
GROUP BY
    customer_id, name
HAVING  array_agg(car_make) @> ARRAY['BMW','Ford'];

vol7ron: Fiddle

The following is a modification of the above, taking the same idea using an array for comparison. I'm not sure how any more efficient it would be compared to the dual-query approach, since it would have to create an array as one pass and then do more heavy-handed comparison because of comparing the elements of an array.

SELECT DISTINCT c.* 
FROM   customers c
WHERE  customer_id IN (
  select   customer_id
  from     customers_cars 
  group by customer_id
  having   array_agg(car_make) @> ARRAY['BMW','Ford']
);