Noob Coder
Noob Coder

Reputation: 2896

mysql multiple left join and group by main table

I have the following scenario. An area has multiple territories, a territory has multiple addresses and an address is visited multiple times in a month. Now I want to generate a monthly report about an area. (How many times an area has been visited).I have written the query but the result set is producing less areas because some addresses are not visited. I have the following structure

tables

areas: id|name (180 rows) //name is unique

territories: id|name|area_id (1k rows)

addresses: id|name|territory_id (80k rows)

visiting_addresses: id|address_id|date|status (1M+ rows) //status => 1 = visited, 2 = pending

My query is following.

select ar.id as area_id, ar.name as area,
sum(case when va.status = 1 then 1 else 0 end) as visited,
sum(case when va.status = 2 then 1 else 0 end) as pending,
count(va.id) as total

from      areas ar
left join territories t on t.area_id=ar.id
left join addresses a on a.territory_id=t.id
left join visiting_addresses va on va.address_id=a.id
where     month(va.date) = '01'
and       year(va.date)='2020'
group by  ar.id

the area table contains 180 areas but the result set shows only 144 areas. Where is my mistake and what is the explanation to this? those areas are missing because they have no visiting.

Upvotes: 0

Views: 435

Answers (2)

Paul Spiegel
Paul Spiegel

Reputation: 31772

Your WHERE clause is converting the LEFT JOIN with visiting_addresses to an INNER JOIN. And since it's the right most table in a LEFT-JOIN-chain, all joins will be converted to INNER JOINS. To prevent that, you should move the corresponding conditions from the WHERE clause to the ON clause:

select ar.id as area_id, ar.name as area,
sum(case when va.status = 1 then 1 else 0 end) as visited,
sum(case when va.status = 2 then 1 else 0 end) as pending,
count(va.id) as total

from      areas ar
left join territories t on t.area_id=ar.id
left join addresses a on a.territory_id=t.id
left join visiting_addresses va
  on  va.address_id=a.id
  and month(va.date) = '01'
  and year(va.date)='2020'

group by  ar.id

But since you have alot of rows, I would rather run two queries. First get only areas with adresses form the last month using inner joins. You should though change your conditions on va.date to utilize an index:

select ar.id as area_id, ar.name as area,
sum(case when va.status = 1 then 1 else 0 end) as visited,
sum(case when va.status = 2 then 1 else 0 end) as pending,
count(va.id) as total

from areas ar
join territories t on t.area_id=ar.id
join addresses a on a.territory_id=t.id
join visiting_addresses va on  va.address_id=a.id
where va.date >= '2020-01-01'
  and va.date <  '2020-02-01'

group by  ar.id

Make sure you have an index on visiting_addresses(date) or even better on visiting_addresses(date, address_id, status).

Then get all areas with a simple

select ar.id as area_id, ar.name as area 
from areas ar

and add missing areas to the first result while setting visited, pending and total to zero (in application code).

The INNER JOIN should be much faster, because now the engine can start reading only the necessary rows from the visiting_addresses using an index for the WHERE conditions.

You could also use a more complex but single query. The Idea is to use a LEFT JOIN with a pre-aggregated subquery:

select ar.id as area_id, ar.name as area,
    coalesce(visited, 0) as visited,
    coalesce(pending, 0) as pending,
    coalesce(total, 0) as total
from areas ar
left join (
    select t.area_id
    sum(case when va.status = 1 then 1 else 0 end) as visited,
    sum(case when va.status = 2 then 1 else 0 end) as pending,
    count(va.id) as total
    from territories t
    join addresses a on a.territory_id=t.id
    join visiting_addresses va on  va.address_id=a.id
    where va.date >= '2020-01-01'
      and va.date <  '2020-02-01'
    group by t.area_id
) x on x.area_id = ar.id

Upvotes: 1

Tim Biegeleisen
Tim Biegeleisen

Reputation: 521093

Try moving the logic in the WHERE clause to the ON clause of the appropriate join:

SELECT
    ar.id AS area_id,
    ar.name AS area,
    COUNT(CASE WHEN va.status = 1 THEN 1 END) AS visited,
    COUNT(CASE WHEN va.status = 2 THEN 1 END) AS pending,
    COUNT(va.id) AS total
FROM areas ar
LEFT JOIN territories t ON t.area_id = ar.id
LEFT JOIN addresses a ON a.territory_id = t.id
LEFT JOIN visiting_addresses va ON va.address_id = a.id AND
    va.date >= '2020-01-01' AND va.date <  '2020-02-01'
GROUP BY
    ar.id;

Note that selecting the name field while only aggregating by id is valid in MySQL, assuming that id be a unique field in the areas table.

You may also try adding the following index to the visiting_addresses table:

CREATE INDEX date_idx ON visiting_addresses (address_id, date, status);

This might help speed up the join to this table.

Upvotes: 0

Related Questions