When joining multiple tables, the sum is a multiple of what it should be

Question

I have 4 tables:

employee: employee_id (and several unrelated columns)
calls: employee_id, date, number_of_calls (on that given date by employee)
employee_skills: employee_id, skills_id
skills: skills_id, primary_skill, secondary_skill

As a site note: In employee_skills combinations of primary_skill and secondary_skill can occur several times with a different skill_id. There can be two entries such as 234 | Python | Java and 235 | Python | Java.

I would like to have a SELECTS-statement that returns: employee_id, number_of_total_calls (of that employee), number_of_unique_secondary_skills(of that employee)

The problem I am facing is regarding the number_of_total_calls_of_employee. When I only join the tables employee and calls, I get the correct number (sum) of total calls of that employee. However, when I also join the tables employee_skills and skills, I get a multiple of the correct number (e.g. 6 times the correct number for employee_id 123). I think that this is the case since when doing so the employee_id occurs several times in the new table, not just once (the employee_id 123 occurs 6 times in the new table), since the employee_id is several times in the table employee_skills (e.g. the employee_id 123 occurs 6 times in employee_skills).

The SELECT-statement I am using looks as follows:

SELECT e.employee_id, SUM(c.number_of_calls), COUNT(DISTINCT s.secondary_skill)
FROM employees e
INNER JOIN calls c
ON e.employee_id = c.employee_id
INNER JOIN  employee_skills es
ON e.employee_id = es.employee_id
INNER JOIN skills s
ON es.skills_id = s.skills_id
GROUP BY e.employee_id;

Gordon Linoff · Accepted Answer

First, you don't need the employees table, because employee_id is in both other tables. I would then recommend aggregating along both dimensions before joining:

SELECT es.employee_id, c.number_of_calls, es.num_secondary_skills
FROM (SELECT c.employee_id, SUM(c.number_of_calls) as number_of_calls
      FROM calls c
      GROUP BY c.employe_id
     ) c JOIN
     (SELECT es.employee_id, COUNT(DISTINCT s.secondary_skill) as num_secondary_skills
      FROM employee_skills es JOIN
           skills s
           ON es.skills_id = s.skills_id
      GROUP BY es.employee_id
     ) es
     ON es.employee_id = c.employee_id;

The above only returns employees that both have calls and skills. If some employees are missing one or the other, then they are filtered out.

If you want all employees, then the employees table is handy -- with outer joins:

SELECT e.employee_id, c.number_of_calls, es.num_secondary_skills
FROM employees e LEFT JOIN
     (SELECT c.employee_id, SUM(c.number_of_calls) as number_of_calls
      FROM calls c
      GROUP BY c.employe_id
     ) c
     ON c.employee_id = e.employee_id LEFT JOIN
     (SELECT es.employee_id, COUNT(DISTINCT s.secondary_skill) as num_secondary_skills
      FROM employee_skills es JOIN
           skills s
           ON es.skills_id = s.skills_id
      GROUP BY es.employee_id
     ) es
     ON es.employee_id = e.employee_id;

When joining multiple tables, the sum is a multiple of what it should be

Answers (2)

Related Questions