Kristian
Kristian

Reputation: 21840

Get top three most common values from every column in a table

I'm trying to write a query that will produce a very small sample of data from each column of a table, in which the sample is made up of the top 3 most common values. This particular problem is part of a bigger task, which is to write scripts that can characterize a database and its tables, its data integrity, and also quickly survey common values in the table on a per-column basis. Think of this as an automated "analysis" of a table.

On a single column basis, I do this already by simply calculating the frequency of values and then sorting by frequency. If I had a column called "color" and all colors were in it, and it just so happened that the color "blue" was in most rows, then the top 1 most frequently occurring value would be "blue". In SQL that is easy to calculate.

However, I'm not sure how I would do this over multiple columns.

Currently, when I do a calculation over all columns of a table, I perform the following type of query:

USE database;

DECLARE @t nvarchar(max)
SET @t = N'SELECT '

SELECT @t = @t + 'count(DISTINCT CAST(' + c.name + ' as varchar(max))) "' + c.name + '",'
FROM sys.columns c 
WHERE c.object_id = object_id('table');

SET @t = SUBSTRING(@t, 1, LEN(@t) - 1) + ' FROM table;'

EXEC sp_executesql @t

However, its not entirely clear to me how I would do that here.

(Sidenote:columns that are of type text, ntext, and image, since those would cause errors while counting distinct values, but i'm less concerned about solving that)

But the problem of getting top three most frequent values per column has got me absolutely stumped.

Ideally, I'd like to end up with something like this:

Col1     Col2              Col3       Col4     Col5
---------------------------------------------------------------------
1,2,3    red,blue,green    29,17,0    c,d,j    nevada,california,utah

Upvotes: 3

Views: 812

Answers (2)

Eric Hauenstein
Eric Hauenstein

Reputation: 2565

I hacked this together, but it seems to work:

I cant help but think I should be using RANK().

USE <DB>;

DECLARE @query nvarchar(max)
DECLARE @column nvarchar(max)
DECLARE @table nvarchar(max)
DECLARE @i INT = 1
DECLARE @maxi INT = 10
DECLARE @target NVARCHAR(MAX) = <table>

declare @stage TABLE (i int IDENTITY(1,1), col nvarchar(max), tbl nvarchar(max))
declare @results table (ColumnName nvarchar(max), ColumnValue nvarchar(max), ColumnCount int, TableName NVARCHAR(MAX))

insert into @stage

select c.name, o.name
    from sys.columns c
    join sys.objects o on o.object_id=c.object_id and o.type = 'u'
    and c.system_type_id IN (select system_type_id from sys.types where [name] not in ('text','ntext','image'))
    and o.name like @target



SET @maxi = (select max(i) from @stage)

while @i <= @maxi

BEGIN

set @column = (select col from @stage where i = @i)
set @table = (select tbl from @stage where i = @i)


SET @query = N'SELECT ' +''''+@column+''''+' , '+ @column

SELECT @query = @query + ', COUNT(  ' + @column + ' ) as count' + @column + ' , ''' + @table + ''' as tablename'
select @query = @query + ' from ' + @table + ' group by ' + @column

--Select @query
insert into @results
EXEC sp_executesql @query

SET @i = @i + 1
END

select * from @results
; with cte as (
                select *, ROW_NUMBER() over (partition by Columnname order by ColumnCount desc) as rn from @results
                )


select * from cte where rn <=3

Upvotes: 1

Raj More
Raj More

Reputation: 48058

Start with this SQL Statement builder, and modify it to suit your liking:

EDIT Added Order by Desc

With ColumnSet As
(
    Select TABLE_SCHEMA, TABLE_NAME, COLUMN_NAME
    From INFORMATION_SCHEMA.COLUMNS
    Where 1=1
        And TABLE_NAME IN ('Table1')
        And COLUMN_NAME IN ('Column1', 'Column2')
)
Select 'Select Top 3 ' + COLUMN_NAME + ', Count (*) NumInstances From ' + TABLE_SCHEMA + '.'+ TABLE_NAME + ' Group By ' + COLUMN_NAME + ' Order by Count (*) Desc'
From ColumnSet

Upvotes: 0

Related Questions