user3788671
user3788671

Reputation: 2057

Remove duplicates with less null values

I have a table of employees which contains about 25 columns. Right now there are a lot of duplicates and I would like to try and get rid of some of these duplicates.

First, I want to find the duplicates by looking for multiple records that have the same values in first name, last name, employee number, company number and status.

SELECT
    firstname,lastname,employeenumber, companynumber, statusflag
FROM
    employeemaster
GROUP BY
    firstname,lastname,employeenumber,companynumber, statusflag
HAVING 
    (COUNT(*) > 1)

This gives me duplicates but my goal is to find and keep the best single record and delete the other records. The "best single record" is defined by the record with the least amount of NULL values in all of the other columns. How can I do this?

I am using Microsoft SQL Server 2012 MGMT Studio.

EXAMPLE:

enter image description here

Red: DELETE Green: KEEP

NOTE: There are a lot more columns in the table than what this table shows.

Upvotes: 8

Views: 3712

Answers (3)

user4420255
user4420255

Reputation:

I test with MySQL and use NULL String concat to found the best record. Because LENGTH ( NULL || 'data') is 0. Only if all column not NULL some length exists. Maybe this is not perfekt.

create table EmployeeMaster
  (
    Record int auto_increment,
    FirstName varchar(50),
    LastName varchar(50),
    EmployeeNumber int,
    CompanyNumber int,
    StatusFlag int,
    UserName varchar(50),
    Branch varchar(50),

    PRIMARY KEY(record)
  );
INSERT INTO EmployeeMaster
  (
  FirstName, LastName, EmployeeNumber, CompanyNumber, StatusFlag, UserName, Branch
  ) VALUES ('Jake', 'Jones', 1234, 1, 1, 'JJONES', 'PHX'), ('Jake', 'Jones', 1234, 1, 1, NULL, 'PHX'), ('Jake', 'Jones', 1234, 1, 1, NULL, NULL), ('Jane', 'Jones', 5678, 1, 1, 'JJONES2', NULL);

My query idea looks like this

 SELECT e.* 
  FROM employeemaster e
  JOIN ( SELECT firstname,
                lastname,
                employeenumber,
                companynumber,
                statusflag,
                MAX( LENGTH ( username || branch ) ) data_quality
           FROM  employeemaster
         GROUP BY firstname, lastname, employeenumber, companynumber, statusflag
        HAVING count(*) > 1
       ) g
    ON LENGTH ( username || branch ) = g.data_quality

Upvotes: 1

Ron Smith
Ron Smith

Reputation: 3266

You can use the sys.columns table to get a list of columns and build a dynamic query. This query will return a 'KeepThese' value for every record you want to keep based on your given criteria.

-- insert test data
create table EmployeeMaster
  (
    Record int identity(1,1),
    FirstName varchar(50),
    LastName varchar(50),
    EmployeeNumber int,
    CompanyNumber int,
    StatusFlag int,
    UserName varchar(50),
    Branch varchar(50)
  );
insert into EmployeeMaster
  (
    FirstName,
    LastName,
    EmployeeNumber,
    CompanyNumber,
    StatusFlag,
    UserName,
    Branch
  )
  values
    ('Jake','Jones',1234,1,1,'JJONES','PHX'),
    ('Jake','Jones',1234,1,1,NULL,'PHX'),
    ('Jake','Jones',1234,1,1,NULL,NULL),
    ('Jane','Jones',5678,1,1,'JJONES2',NULL);

-- get records with most non-null values with dynamic sys.column query
declare @sql varchar(max)
select @sql = '
    select e.*,
        row_number() over(partition by
                            e.FirstName,
                            e.LastName,
                            e.EmployeeNumber,
                            e.CompanyNumber,
                            e.StatusFlag
                          order by n.NonNullCnt desc) as KeepThese
    from EmployeeMaster e
        cross apply (select count(n.value) as NonNullCnt from (select ' +
            replace((
                select 'cast(' + c.name + ' as varchar(50)) as value union all select '
                from sys.columns c
                where c.object_id = t.object_id
                for xml path('')
                ) + '#',' union all select #','') + ')n)n'
from sys.tables t
where t.name = 'EmployeeMaster'

exec(@sql)

Upvotes: 3

Pரதீப்
Pரதீப்

Reputation: 93754

Try this.

;WITH cte
     AS (SELECT Row_number()
                  OVER(
                    partition BY firstname, lastname, employeenumber, companynumber, statusflag
                    ORDER BY (SELECT NULL)) rn,
                firstname,
                lastname,
                employeenumber,
                companynumber,
                statusflag,
                username,
                branch
         FROM   employeemaster),
     cte1
     AS (SELECT a.firstname,
                a.lastname,
                a.employeenumber,
                a.companynumber,
                a.statusflag,
                Row_number()
                  OVER(
                    partition BY a.firstname, a.lastname, a.employeenumber, a.companynumber, a.statusflag
                    ORDER BY (CASE WHEN a.username IS NULL THEN 1 ELSE 0 END +CASE WHEN a.branch IS NULL THEN 1 ELSE 0 END) )rn
                        -- add the remaining columns in case statement
         FROM   cte a
                JOIN employeemaster b
                  ON a.firstname = b.firstname
                     AND a.lastname = b.lastname
                     AND a.employeenumber = b.employeenumber
                     AND a.companynumbe = b.companynumber
                     AND a.statusflag = b.statusflag)
SELECT *
FROM   cte1
WHERE  rn = 1 

Upvotes: 1

Related Questions