KnowHoper
KnowHoper

Reputation: 4622

Smart SQL Merge - n rows, coalesce

Can anyone help with a SQL problem Im having whereby I need to merge n number of rows into one record. The individual records may or may not have fields populated that others do.

Basically I have an issue where duplicate records have been created in SQL. Some contain information others dont. I need to merge them (I can rank them), updating a field if the value doesn't exist in the former record (starting with highest ranked first).

For instance, if I have two user records, one has last name populated, the other has first name. These are duplicates and need to be merged into one record, like a coalesce. However, the are n number of rows.

Its essentially a transpose of many records into one, where a field is updated only if a duplicate record lower in rank has that field populated and the field doesn't exist in the higher ranked row.

Here is a very simplified version of the issue. As you can see, using SQL Fiddle the script creates 6 records. These records should be merged into 2 records and all the fields filled in.

The problem is, there could be x number of rows. I cannot use a coalesce statement as there is variance to the number of rows.

Hope that make sense?

CREATE TABLE [dbo].[Employee]([EmployeeId] varchar(10) NULL,
[First Name] [varchar](30) NULL,
[Middle Name] [varchar](30) NOT NULL,
[Last Name] [varchar](30) NOT NULL,
[E-Mail] [varchar](80) NOT NULL)


insert into Employee(EmployeeId,[First Name],[Middle Name],[Last Name],[E-Mail])
values('BOB1','Bob','','','[email protected]');

insert into Employee(EmployeeId,[First Name],[Middle Name],[Last Name],[E-Mail])
values('BOB1','','John','','[email protected]');

insert into Employee(EmployeeId,[First Name],[Middle Name],[Last Name],[E-Mail])
values('BOB1','','','Smith','[email protected]');

insert into Employee(EmployeeId,[First Name],[Middle Name],[Last Name],[E-Mail])
values('MARK1','','Peter','','[email protected]');

insert into Employee(EmployeeId,[First Name],[Middle Name],[Last Name],[E-Mail])
values('MARK1','Mark','','','[email protected]');

insert into Employee(EmployeeId,[First Name],[Middle Name],[Last Name],[E-Mail])
values('MARK1','','','Davis','[email protected]');


select * from [Employee]

Hope that makes sense.

Thanks

Upvotes: 1

Views: 1519

Answers (3)

Krzysztof Kozielczyk
Krzysztof Kozielczyk

Reputation: 5937

If the performance is important enough to justify a couple of hours of coding and you are allowed to use SQLCLR, you can calculate all values in a single table scan with multi-parameter User Defined Aggregare.

Here's an example of an aggregate that returns lowest-ranked non-NULL string:

using System;
using System.Data;
using System.Data.SqlClient;
using System.Data.SqlTypes;
using System.IO;
using Microsoft.SqlServer.Server;

[Serializable]
[SqlUserDefinedAggregate(Format.UserDefined, MaxByteSize = -1, IsNullIfEmpty = true)]
public struct LowestRankString : IBinarySerialize
{
    public int currentRank;
    public SqlString currentValue;

    public void Init()
    {
        currentRank = int.MaxValue;
        currentValue = SqlString.Null;
    }

    public void Accumulate(int Rank, SqlString Value)
    {
        if (!Value.IsNull)
        {
            if (Rank <= currentRank)
            {
                currentRank = Rank;
                currentValue = Value;
            }
        }
    }

    public void Merge(LowestRankString Group)
    {
        Accumulate(Group.currentRank, Group.currentValue);
    }

    public SqlString Terminate()
    {
        return currentValue;
    }

    public void Read(BinaryReader r)
    {
        currentRank = r.ReadInt32();
        bool hasValue = r.ReadBoolean();
        if (hasValue)
        {
            currentValue = new SqlString(r.ReadString());
        }
        else
        {
            currentValue = SqlString.Null;
        }
    }

    public void Write(BinaryWriter w)
    {
        w.Write(currentRank);

        bool hasValue = !currentValue.IsNull;
        w.Write(hasValue);
        if (hasValue)
        {
            w.Write(currentValue.Value);
        }
    }
}

Assuming your table looks something like this:

CREATE TABLE TopNonNullRank ( Id INT NOT NULL, UserId NVARCHAR (32) NOT NULL, Value1 NVARCHAR (128) NULL, Value2 NVARCHAR (128) NULL, Value3 NVARCHAR (128) NULL, Value4 NVARCHAR (128) NULL, PRIMARY KEY CLUSTERED (Id ASC) );

INSERT INTO TopNonNullRank (Id, UserId, Value1, Value2, Value3, Value4) VALUES 
    (1, N'Ada', NULL, N'Top value 2 for A', N'Top value 3 for A', NULL),
    (2, N'Ada', N'Top value 1 for A', NULL, N'Other value 3', N'Top value 4 for A'),
    (3, N'Ada', N'Other value 1 for A', N'Other value 2 for A', N'Other value 3 for A', NULL),
    (4, N'Bob', N'Top value 1 for B', NULL, NULL, NULL),
    (5, N'Bob', NULL, NULL, NULL, N'Top value 4 for B'),
    (6, N'Bob', N'Other value 1 for B', N'Top value 2 for B', NULL, N'Other value 4');

The following simple query returns top non-NULL value for each column.

SELECT 
    UserId,
    dbo.LowestRankString(Id, Value1) AS TopValue1,
    dbo.LowestRankString(Id, Value2) AS TopValue2,
    dbo.LowestRankString(Id, Value3) AS TopValue3,
    dbo.LowestRankString(Id, Value4) AS TopValue4
FROM TopNonNullRank
GROUP BY UserId

The only thing left is merging the results back to the original table. The simplest way would be something like this:

WITH TopValuesPerUser AS
(
    SELECT 
        UserId,
        dbo.LowestRankString(Id, Value1) AS TopValue1,
        dbo.LowestRankString(Id, Value2) AS TopValue2,
        dbo.LowestRankString(Id, Value3) AS TopValue3,
        dbo.LowestRankString(Id, Value4) AS TopValue4
    FROM TopNonNullRank
    GROUP BY UserId
)
UPDATE TopNonNullRank
SET
    Value1 = TopValue1,
    Value2 = TopValue2,
    Value3 = TopValue3,
    Value4 = TopValue4
FROM TopNonNullRank AS OriginalTable
LEFT JOIN TopValuesPerUser ON TopValuesPerUser.UserId = OriginalTable.UserId;

Note that this update still leaves you with duplicate rows, and you would need to get rid of them.

You could also get more fancy and store the results of this query into a temporary table, and then use MERGE statement to apply them to the original table.

Another option would be to store the results in a new table, and then swap it with the original table using sp_rename stored proc.

Upvotes: 1

David Archer
David Archer

Reputation: 2121

This should work to set each row's value for a particular field to the highest ranked, non-NULL value for that group of duplicate rows.

GroupingKey is either a field or some logic for how you're determining a set of rows that need to be merged into one. Ranking is your ranking function.

UPDATE SomeTable SET
    Field1 =
    (
        SELECT TOP 1 t2.Field1
        FROM SomeTable t2
        WHERE
            t2.Field1 IS NOT NULL
            AND
            t2.GroupingKey = SomeTable.GroupingKey
        ORDER BY Ranking DESC
    )
FROM SomeTable

You can repeat this for every field you need to update or just update the query to set them all, just copy-paste. Once you've updated all the fields, the only thing that remains is to remove the duplicate records and you can use the same GroupingKey and Ranking to remove all but the highest ranked row for a particular GroupingKey.

DELETE SomeTable
FROM SomeTable
INNER JOIN
(
    SELECT 
        GroupingKey, 
        MAX(Ranking) as MaxRanking
    FROM SomeTable
) t2 ON
    SomeTable.GroupingKey = t2.GroupingKey
    AND
    SomeTable.Ranking < t2.MaxRanking

Upvotes: 0

whytheq
whytheq

Reputation: 35587

Without an example of the data you are dealing with it is difficult to precisely interpret this question. You could create a live example in here for us to play with.

So without an example, if I assume there are numeric fields [N] and [M] in two tables myTable1 and myTable2 then why not just use COALESCE with a FULL OUTER JOIN..

SELECT
  fieldx = COALESCE(a.fieldx, b.fieldx),
  fieldy = COALESCE(a.fieldy, b.fieldy),
  fieldz = COALESCE(a.fieldz, b.fieldz),
  [N]=SUM(ISNULL(a.[N],0.0)),
  [M]=SUM(ISNULL(b.[M],0.0))
FROM
  myTable1 a
  FULL OUTER JOIN myTable2 b
     ON a.fieldx = b.fieldx 
        a.fieldy = b.fieldy 
        a.fieldz = b.fieldz 
GROUP BY
   COALESCE(a.fieldx, b.fieldx),
   COALESCE(a.fieldy, b.fieldy),
   COALESCE(a.fieldz, b.fieldz)

Upvotes: 0

Related Questions