Reputation: 81
This is a SQL Server question.
I'm linking three tables from the same database and keep getting duplicate values. Tables REF_Plan_Dictionary and REF_Plan_Assign_Default will produce duplicates on their own because some of the columns will reference multiple sub-options in other columns. Table dev_OUT_MasterEmp will only produce one record per EmpID.
The purpose of the query is to use the M.EmpID and match the information from the other two tables to this piece of information. The duplicates are where the EmpID are repeating several times but all of the information matches only once. Meaning the sub-options I referred to are not showing up since the other two tables are matching correctly. So it's not the matching, it's the fact that the same information is duplicated.
How do I get rid of the duplicates?
SELECT M.EmpId
,EmpName
,[UserId]
,PA.Plan_Dict_Id
,m.Job_Code
,Dept_Num
,PayGroup
,D.Plan_Name
,Plan_Desc
,M.[File_Nbr]
,[Mgr_EmpId]
,[Work_Location]
,[Emp_Tenure]
,[Emp_Status]
,[Plan_Eligibility]
,M.[Function_Role_Code]
,[Hire_Date]
,[Job_Entry_Date]
,[Term_Date]
,[Dept_Num]
,[TeamID]
,[CCGroup]
,[Channel]
,[Organization]
,[Hourly_Rate]
,[HC]
FROM [Compensation].[dbo].[dev_OUT_MasterEmp] M
left join Compensation..REF_Plan_Dictionary D on M.Plan_Id = D.Plan_Dict_Id
left join Compensation..REF_Plan_Assign_Default PA on M.Plan_Id = PA.Plan_Dict_Id
This is an example of my results:
This is table REF_Plan Dictionary:
This is table REF_Plan_Assign_Default:
Upvotes: 0
Views: 275
Reputation: 6713
The best way to avoid unwanted duplicate records is to plan out from the start what you consider a duplicate. In other words, do you want one record per employee id or do you want one record per (employee, user account)... etc. Once you've determined the unique key to the output you desire, then you structure your query around it by ensuring that when joining on that key that both sides of the join have only a 1 to 1 match.
So for example, assume that you want only one record per employee id. You can start with your employee table since that is the key of this table. Then every table that you join to it, you have to make sure only has one record per employee id in it as well. You can ensure this with sub-queries that contain a group by. So something like:
Select ...
from employees
join
(Select employeeid, sum(field1), max(field2)...
from employeedetail
group by employeeid
) employeedetails
It looks like in your case you have two types of "detail" tables and your employee record has a foreign key to them. So if you want one record per employee, then make sure that you only get one record returned for both of your detail records foreign key value.
select m.*
FROM [Compensation].[dbo].[dev_OUT_MasterEmp] M
left join
(
select Plan_Dict_Id, aggregate_function(field1) as field1... etc.
Compensation..REF_Plan_Dictionary
group by Plan_Dict_Id
) D
on M.Plan_Id = D.Plan_Dict_Id
left join
(
select Plan_Dict_Id, aggregate_function(field1) as field1... etc.
Compensation..REF_Plan_Assign_Default
group by Plan_Dict_Id
) PA
on M.Plan_Id = PA.Plan_Dict_Id
In the sub-queries, you have to use aggregate functions on the fields (Field1
above) because there might be duplicates. So for example, maybe there is a Payment field and you want to know the total payments that were made to this plan_id
, then you would use sum(Payment)
.
Another way to eliminate all but one "detail" record is to filter. So maybe you only want to see the most recent record. In that case you would use a filter to eliminate all duplicates but the one you are interested in. Your sub-query might look something like this:
select *
from DetailTable
join (
select ID, MAX(updated_date) updated_date
from DetailTable
group by ID
) MaxRecord
on DetailTable.ID = MaxRecord.ID
and DetailTable.updated_date = MaxRecord.updated_date
This sub-query would return only one record per ID
and it would be the record that had the largest updated_date
for that ID.
I can't tell you how you should structure your sub-queries because I can't tell exactly what you are looking to do, but the idea in general is to start with a table or a couple of tables with a join that you can guarantee have a unique key. Then you can do joins and left joins after that to other tables or sub-queries that also have this same unique key. You will never have unexpected duplicates again when you master this.
Upvotes: 0
Reputation: 2583
This is the basic plan:
SELECT EMP.*, PL.Plan_NAME , PL.Plan_DESC,
MAP.Job_Code, MAP.PayGroup
FROM [Compensation].[dbo].[dev_OUT_MasterEmp] EMP
JOIN (
SELECT M.EMPID, MAX(PA.Plan_Dict_Id) AS M_PLAN_ID,
MAX(PA.JobCode) AS Job_Code, MAX(PA.PayGroup) AS PayGroup
FROM [Compensation].[dbo].[dev_OUT_MasterEmp] M
LEFT JOIN Compensation..REF_Plan_Assign_Default PA
ON M.PLAN_ID = PA.Plan_Dict_Id
GROUP BY M.EMPID
) MAP
ON MAP.EMPID= EMPS.EMPID
JOIN Compensation..REF_Plan_Dictionary PL
ON MAP.M_PLAN_ID = PL.PLAN_DICT_ID
please modify the output columns to suit your needs.
Upvotes: 1