whsv26
whsv26

Reputation: 63

Join after Group by performance

Join tables and then group by multiple columns (like title) or group rows in sub-query and then join other tables? Is the second method slow because of lack of indexes after grouping? Should I order rows manually for second method to trigger merge join instead of nested loop? How to do it properly?

This is the first method. Became quite a mess cause of contragent_title and product_title are required to be in group by for strict mode. And I work with strict group by mode only.

SELECT
    s.contragent_id,
    s.contragent_title,
    s.product_id AS sort_id,
    s.product_title AS sort_title,
    COALESCE(SUM(s.amount), 0) AS amount,
    COALESCE(SUM(s.price), 0) AS price,
    COALESCE(SUM(s.discount), 0) AS discount,
    COUNT(DISTINCT s.product_id) AS sorts_count,
    COUNT(DISTINCT s.contragent_id) AS contragents_count,
    dd.date,
    ~grouping(dd.date, s.contragent_id, s.product_id) :: bit(3) AS mask
FROM date_dimension dd
LEFT JOIN (
    SELECT 
        s.id, 
        s.created_at,
        s.contragent_id, 
        ca.title AS contragent_title,
        p.id AS product_id, 
        p.title AS product_title,
        sp.amount, 
        sp.price, 
        sp.discount
    FROM sales s
    LEFT JOIN sold_products sp 
        ON s.id = sp.sale_id
    LEFT JOIN products p 
        ON sp.product_id = p.id
    LEFT JOIN contragents ca 
        ON s.contragent_id = ca.id
    WHERE s.created_at BETWEEN :caf AND :cat
        AND s.plant_id = :plant_id
        AND (s.is_cache = :is_cache OR :is_cache IS NULL)
        AND (sp.product_id = :sort_id OR :sort_id IS NULL)
) s ON dd.date = date(s.created_at)                
WHERE (dd.date BETWEEN :caf AND :cat)
GROUP BY GROUPING SETS (
    (dd.date, s.contragent_id, s.contragent_title, s.product_id, s.product_title),
    (dd.date, s.contragent_id, s.contragent_title),
    (dd.date)
)

Upvotes: 1

Views: 2407

Answers (1)

Thorsten Kettner
Thorsten Kettner

Reputation: 94894

This is an example of what you are talking about:

Join, then aggregate:

select d.name, count(e.employee_id) as number_of_johns
from departments d
left join employees e on e.department_id = e.department_id
where e.first_name = 'John'
group by d.department_id;

Aggregate then join:

select d.name, coalesce(number_of_johns, 0) as number_of_johns
from departments d
left join
(
  select department_id, count(*) as number_of_johns
  from employees
  where first_name = 'John'
  group by department_id
) e on e.department_id = e.department_id;

Question

You want to know whether one is faster than the other, assuming the latter may be slower for loosing the direct table links via IDs. (While every query result is a table, and hence the subquery result also is, it is no physical table stored in the database and has hence no indexes.)

Thinking and guessing

Let's see what the queries do:

  1. The first query is supposed to join all departments and employees and only keep the Johns. How will it do that? It will probably find all Johns first. If there is an index on employees(first_name), it will probably use that, otherwise it will read the full table. Then find the counts by department_id. If the index I talked about even contained the department (index on employees(first_name, department_id), the DBMS would now have the Johns presorted and could just count. If it doesn't the DBMS may order the employee rows now and count then or use some other method for counting. And if we were looking for two names instead of just one, the compound index would be of little or no benefit compared to the mere index on first_name. At last the DBMS will read all departments and join the found counts. But our count result rows are not a table, so there is no index we can use. Anyway, the DBMS will just either just loop over the results or have them sorted anyway, so the join is easy peasy. So far from what I think the DBMS will do. There are a lot of ifs in my assumptions and the DBMS may still have other methods to choose from or won't use an index at all because the tables are so small anyway, or whatever.
  2. The second query, well, same same.

Answer

You see, we can only guess how a DBMS will approach joins with aggregations. It may or may not come up with the same execution plan for the two queries. A perfect DBMS would create the same plan, as the two queries do the same thing. A not so perfect DBMS may create different plans, but which is better we can hardly guess. Let's just rely on the DBMS to do a good job concerning this.

I am using Oracle mainly and just tried about the same thing as shown with two of my tables. It shows exactly the same execution plan for both queries. PostgreSQL is also a great DBMS. Nothing to worry about, I'd say :-)

Better focus on writing readable, maintainable queries. With these small queries there is no big difference; the first one is a tad mor compact and easy to grab, the second a tad more sophisticated.

I, personally, prefer the second query. It is good style to aggregate before joining and such queries can be easily extended with further aggregations, which can be much more difficult with the first one. Only if I ran into performance issues, I would try a different approach.

Upvotes: 2

Related Questions