Reputation: 431

Filtering SELECT statement by row with most populated columns

I have a statement that SELECTs distinct rows from a table with a two value index

SELECT distinct Reference, Value1, Value2, Value3, Value4 FROM [tblHistory]

Where Reference is an index with another field "Project". For a particular system this data is inserted into another table using only Reference as the index because Value1 through Value4 SHOULD always be the same for the same Reference - however in about 1/500 it is not.

In the case where there is a duplicate Reference AND differences in one or more of the Value1-Value4 fields I need to pick the row with the most completed Value1-Value4 fields as they are often NULL. If all instances have the same number of populated columnsI can return the first row found.

Other than using temporary tables and code like

case when Value1 is null then 1 else 0 end 
+ case when Value2 is null then 1 else 0 end 
+ case when Value3 is null then 1 else 0 end
+ case when Value4 is null then 1 else 0 end
as CountOfNulls

Is there a way to filter the data so I only get the most populated row?

I'm running MS SQL Server 2000.

Upvotes: 3

Answers (4)

PCurd

Reputation: 431

Well it's not a great way to do it, but I have put together this code and it works:

SELECT distinct Reference, Value1, Value2, Value3, Value4

FROM [tblHistory]
WHERE Reference+cast(4-(case when Value1 is null then 1 else 0 end 
+ case when Value2 is null then 1 else 0 end 
+ case when Value3 is null then 1 else 0 end
+ case when Value4 is null then 1 else 0 END) AS varchar) IN (

SELECT myref + CAST(MAX(CountOfNonNulls) AS VARCHAR) FROM
(

SELECT myref, 4-(case when Value1 is null then 1 else 0 end 
+ case when Value2 is null then 1 else 0 end 
+ case when Value3 is null then 1 else 0 end
+ case when Value4 is null then 1 else 0 end)
as CountOfNonNulls

FROM [tblHistory]
)l
GROUP BY Reference
)

I don't actually have this table structure so I haven't tested it but it seems to work. The idea is to make a "new" key by adding the highest CountOfNonNulls to the Reference field and using that to limit the select - it means the nasty CASE code is run twice but the other filters I have (not shown) limit the population to roughly 80 rows in my system so I can live with this.

I have yet to see what it would do if there were two rows with the same CountOfNonNulls value but different Value1-Value4 fields - I think it would break. In that case I would probably add the Value1-Value4 fields to my "new" key but that's a bit silly.

Any suggestions of improvement would be much appreciated!

Upvotes: 0

Jeff Martin

Reputation: 11022

You could look into the Coalesce function but honestly I would probably do the Case statement as you have above.

What reason would you not want to use it?

According to the comments, the table is more than just 4 value fields. But the fear is that a temporary table keeping the highest count of nulls will be necessary.

I feel that the Case solution, potentially implemented into a view is still a viable and good solution.

Upvotes: 1

Irawan Soetomo

Reputation: 1325


-- count() will not include NULL, so we can avoid making complex conditions
;
with
sum_cnt
(
    Reference,
    cnt
)
as
(
    select 
        Reference, 
        count(Value1) + count(Value2) + count(Value3) + count(Value4) 
    from 
        tblHistory 
    group by 
        Reference
)
select top 1
    Reference
from
    sum_cnt 
order by
    cnt desc

Upvotes: 1

Bork Blatt

Reputation: 3368

This is just a hunch, since I haven't seen your database, but it looks like this is one table that should actually be two. As I said, this may have been done for performance or other legit reasons, but the way you need to report over it suggests this table should be split.

Nonetheless, if I had no other alternative, I'd create a column with the count of filled in columns for the row, and programmatically update it whenever the particular record is updated.

Upvotes: 1

Filtering SELECT statement by row with most populated columns

Answers (4)

Related Questions