Snæbjørn
Snæbjørn

Reputation: 10792

Transform raw data into relational data

Intro

I've been given a messy excel dump straight into a table. Now I need to turn that mess into something useful. The dump has duplicates and inconsistencies... good times!

I've been striking out on every approach so far :( - Hope you can help me out.

Given this example data set:

ExcelDump
+----+------+------+------+
| ID | Col1 | Col2 | Col3 |
+----+------+------+------+
|  1 |      |      | C    |
|  1 |      | B    | C    |
|  1 | A    | B    | D    |
|  1 | E    | B    | C    |
|  2 | A    | B    | C    |
|  2 | A    | B    | C    |
|  3 | A    | B    | C    |
|  3 | A    | B    | F    |
|  4 | A    | B    | C    |
|  4 | G    | B    | C    |
+----+------+------+------+

One possible result could be:

OutputTable
+----+------+------+------+
| ID | Col1 | Col2 | Col3 |
+----+------+------+------+
|  1 | A    | B    | C    |
|  2 | A    | B    | C    |
|  3 | A    | B    | C    |
|  4 | A    | B    | C    |
+----+------+------+------+

Nice and neat. Unique ID key and data merged together in a way that makes sense.

How to choose which data is correct?

You've probably noticed that another possible result could be:

+----+------+------+------+
| ID | Col1 | Col2 | Col3 |
+----+------+------+------+
|  1 | E    | B    | C    |
|  2 | A    | B    | C    |
|  3 | A    | B    | F    |
|  4 | G    | B    | C    |
+----+------+------+------+

This is where it gets complicated. I want to be able to choose the set that makes the most sense based on some conditions I can manipulate.

For instance I want to setup a condition that says: "Choose the most (non-null) common value, if no most common found take the first value found that is not null." This condition should be applied to the selection of grouped by IDs. The result of that condition would be:

+----+------+------+------+
| ID | Col1 | Col2 | Col3 |
+----+------+------+------+
|  1 | A    | B    | C    |
|  2 | A    | B    | C    |
|  3 | A    | B    | C    |
|  4 | A    | B    | C    |
+----+------+------+------+

If I later find out that that assumption was wrong and it instead should be: "Choose the most (non-null) common value, if no most common found take the last value found that is not null."

+----+------+------+------+
| ID | Col1 | Col2 | Col3 |
+----+------+------+------+
|  1 | E    | B    | C    |
|  2 | A    | B    | C    |
|  3 | A    | B    | F    |
|  4 | G    | B    | C    |
+----+------+------+------+

So basically I want to select values based a set of conditions on each group of IDs.

Upvotes: 1

Views: 219

Answers (3)

Zhenny
Zhenny

Reputation: 831

I've modified my solution to take into account the extra information added in the question. The below query will get you the second sort priority you specified. In order to get the first one, you'd change the "max" in the outer apply to "min" and change the "sortOrder desc" to "sortOrder asc". Keep in mind if you have multiple ties for most frequent, say A,A,B,B,C and A came first, it would go with B in the below code because that was the highest count and came after the 2 A's.

-- setup test table
create table ExcelDump(
    id int
,   Col1 char(1)
,   Col2 char(1)
,   Col3 char(1)
)

insert into ExcelDump values(1,null,null,'C')
insert into ExcelDump values(1,null,'B','C')
insert into ExcelDump values(1,'A','B','D')
insert into ExcelDump values(1,'E','B','C')
insert into ExcelDump values(2,'A','B','C')
insert into ExcelDump values(2,'A','B','C')
insert into ExcelDump values(3,'A','B','C')
insert into ExcelDump values(3,'A','B','F')
insert into ExcelDump values(4,'A','B','C')
insert into ExcelDump values(4,'G','B','C')

-- create temp tables to make it easier to debug
select distinct
    id
into #distinct
from ExcelDump

-- number order isn't guaranteed but should be sorting them as first come first serve from the original table if no indexes exist
select
    row_number() over(order by (select 1)) as numberOrder
,   ID
,   Col1
,   Col2
,   Col3
into #sorted
from ExcelDump

-- actual query
select
    ui.Id
,   col1.Col1
,   col2.Col2
,   col3.Col3
from #distinct ui
  outer apply (
        select top 1
            ed.Col1
        ,   count(*) as cnt
        ,   max(ed.numberOrder) as sortOrder
        from #sorted ed
        where ed.id = ui.id
        and ed.Col1 is not null -- ignore nulls
        group by ed.Col1
        order by cnt desc, sortOrder desc -- get most common value, then get last one found if there are multiple
    ) col1
  outer apply (
        select top 1
            ed.Col2
        ,   count(*) as cnt
        ,   max(ed.numberOrder) as sortOrder
        from #sorted ed
        where ed.id = ui.id
        and ed.Col2 is not null -- ignore nulls
        group by ed.Col2
        order by cnt desc, sortOrder desc -- get most common value, then get last one found if there are multiple
    ) col2
  outer apply (
        select top 1
            ed.Col3
        ,   count(*) as cnt
        ,   max(ed.numberOrder) as sortOrder
        from #sorted ed
        where ed.id = ui.id
        and ed.Col3 is not null -- ignore nulls
        group by ed.Col3
        order by cnt desc, sortOrder desc -- get most common value, then get last one found if there are multiple
    ) col3

Upvotes: 1

npclaudiu
npclaudiu

Reputation: 2441

You can also use cursors to iterate through your temporary ExcelDump table to filter each row. You could store the filtered results into another temporary table that can have its own constraints like unique or not null if necessary and, by using cursors, you can write specialized code to handle every validation you require.

Upvotes: 0

JNK
JNK

Reputation: 65157

As written, you can do this with a simple GROUP BY:

SELECT 
    id, 
    Col1 = MAX(Col1),
    Col2 = MAX(Col2),
    Col3 = MAX(Col3)
FROM
   ExcelDump
GROUP BY
   id

This pattern will give you the highest non-null value per column per id value.

Upvotes: 3

Related Questions