Reputation: 10792
I've been given a messy excel dump straight into a table. Now I need to turn that mess into something useful. The dump has duplicates and inconsistencies... good times!
I've been striking out on every approach so far :( - Hope you can help me out.
Given this example data set:
ExcelDump
+----+------+------+------+
| ID | Col1 | Col2 | Col3 |
+----+------+------+------+
| 1 | | | C |
| 1 | | B | C |
| 1 | A | B | D |
| 1 | E | B | C |
| 2 | A | B | C |
| 2 | A | B | C |
| 3 | A | B | C |
| 3 | A | B | F |
| 4 | A | B | C |
| 4 | G | B | C |
+----+------+------+------+
One possible result could be:
OutputTable
+----+------+------+------+
| ID | Col1 | Col2 | Col3 |
+----+------+------+------+
| 1 | A | B | C |
| 2 | A | B | C |
| 3 | A | B | C |
| 4 | A | B | C |
+----+------+------+------+
Nice and neat. Unique ID key and data merged together in a way that makes sense.
You've probably noticed that another possible result could be:
+----+------+------+------+
| ID | Col1 | Col2 | Col3 |
+----+------+------+------+
| 1 | E | B | C |
| 2 | A | B | C |
| 3 | A | B | F |
| 4 | G | B | C |
+----+------+------+------+
This is where it gets complicated. I want to be able to choose the set that makes the most sense based on some conditions I can manipulate.
For instance I want to setup a condition that says: "Choose the most (non-null) common value, if no most common found take the first value found that is not null." This condition should be applied to the selection of grouped by IDs. The result of that condition would be:
+----+------+------+------+
| ID | Col1 | Col2 | Col3 |
+----+------+------+------+
| 1 | A | B | C |
| 2 | A | B | C |
| 3 | A | B | C |
| 4 | A | B | C |
+----+------+------+------+
If I later find out that that assumption was wrong and it instead should be: "Choose the most (non-null) common value, if no most common found take the last value found that is not null."
+----+------+------+------+
| ID | Col1 | Col2 | Col3 |
+----+------+------+------+
| 1 | E | B | C |
| 2 | A | B | C |
| 3 | A | B | F |
| 4 | G | B | C |
+----+------+------+------+
So basically I want to select values based a set of conditions on each group of IDs.
Upvotes: 1
Views: 219
Reputation: 831
I've modified my solution to take into account the extra information added in the question. The below query will get you the second sort priority you specified. In order to get the first one, you'd change the "max" in the outer apply to "min" and change the "sortOrder desc" to "sortOrder asc". Keep in mind if you have multiple ties for most frequent, say A,A,B,B,C and A came first, it would go with B in the below code because that was the highest count and came after the 2 A's.
-- setup test table
create table ExcelDump(
id int
, Col1 char(1)
, Col2 char(1)
, Col3 char(1)
)
insert into ExcelDump values(1,null,null,'C')
insert into ExcelDump values(1,null,'B','C')
insert into ExcelDump values(1,'A','B','D')
insert into ExcelDump values(1,'E','B','C')
insert into ExcelDump values(2,'A','B','C')
insert into ExcelDump values(2,'A','B','C')
insert into ExcelDump values(3,'A','B','C')
insert into ExcelDump values(3,'A','B','F')
insert into ExcelDump values(4,'A','B','C')
insert into ExcelDump values(4,'G','B','C')
-- create temp tables to make it easier to debug
select distinct
id
into #distinct
from ExcelDump
-- number order isn't guaranteed but should be sorting them as first come first serve from the original table if no indexes exist
select
row_number() over(order by (select 1)) as numberOrder
, ID
, Col1
, Col2
, Col3
into #sorted
from ExcelDump
-- actual query
select
ui.Id
, col1.Col1
, col2.Col2
, col3.Col3
from #distinct ui
outer apply (
select top 1
ed.Col1
, count(*) as cnt
, max(ed.numberOrder) as sortOrder
from #sorted ed
where ed.id = ui.id
and ed.Col1 is not null -- ignore nulls
group by ed.Col1
order by cnt desc, sortOrder desc -- get most common value, then get last one found if there are multiple
) col1
outer apply (
select top 1
ed.Col2
, count(*) as cnt
, max(ed.numberOrder) as sortOrder
from #sorted ed
where ed.id = ui.id
and ed.Col2 is not null -- ignore nulls
group by ed.Col2
order by cnt desc, sortOrder desc -- get most common value, then get last one found if there are multiple
) col2
outer apply (
select top 1
ed.Col3
, count(*) as cnt
, max(ed.numberOrder) as sortOrder
from #sorted ed
where ed.id = ui.id
and ed.Col3 is not null -- ignore nulls
group by ed.Col3
order by cnt desc, sortOrder desc -- get most common value, then get last one found if there are multiple
) col3
Upvotes: 1
Reputation: 2441
You can also use cursors to iterate through your temporary ExcelDump table to filter each row. You could store the filtered results into another temporary table that can have its own constraints like unique or not null if necessary and, by using cursors, you can write specialized code to handle every validation you require.
Upvotes: 0
Reputation: 65157
As written, you can do this with a simple GROUP BY
:
SELECT
id,
Col1 = MAX(Col1),
Col2 = MAX(Col2),
Col3 = MAX(Col3)
FROM
ExcelDump
GROUP BY
id
This pattern will give you the highest non-null value per column per id value.
Upvotes: 3