Reputation: 21840
I'm trying to write a query that will produce a very small sample of data from each column of a table, in which the sample is made up of the top 3 most common values. This particular problem is part of a bigger task, which is to write scripts that can characterize a database and its tables, its data integrity, and also quickly survey common values in the table on a per-column basis. Think of this as an automated "analysis" of a table.
On a single column basis, I do this already by simply calculating the frequency of values and then sorting by frequency. If I had a column called "color" and all colors were in it, and it just so happened that the color "blue" was in most rows, then the top 1 most frequently occurring value would be "blue". In SQL that is easy to calculate.
However, I'm not sure how I would do this over multiple columns.
Currently, when I do a calculation over all columns of a table, I perform the following type of query:
USE database;
DECLARE @t nvarchar(max)
SET @t = N'SELECT '
SELECT @t = @t + 'count(DISTINCT CAST(' + c.name + ' as varchar(max))) "' + c.name + '",'
FROM sys.columns c
WHERE c.object_id = object_id('table');
SET @t = SUBSTRING(@t, 1, LEN(@t) - 1) + ' FROM table;'
EXEC sp_executesql @t
However, its not entirely clear to me how I would do that here.
(Sidenote:columns that are of type text, ntext, and image, since those would cause errors while counting distinct values, but i'm less concerned about solving that)
But the problem of getting top three most frequent values per column has got me absolutely stumped.
Ideally, I'd like to end up with something like this:
Col1 Col2 Col3 Col4 Col5
---------------------------------------------------------------------
1,2,3 red,blue,green 29,17,0 c,d,j nevada,california,utah
Upvotes: 3
Views: 812
Reputation: 2565
I hacked this together, but it seems to work:
I cant help but think I should be using RANK().
USE <DB>;
DECLARE @query nvarchar(max)
DECLARE @column nvarchar(max)
DECLARE @table nvarchar(max)
DECLARE @i INT = 1
DECLARE @maxi INT = 10
DECLARE @target NVARCHAR(MAX) = <table>
declare @stage TABLE (i int IDENTITY(1,1), col nvarchar(max), tbl nvarchar(max))
declare @results table (ColumnName nvarchar(max), ColumnValue nvarchar(max), ColumnCount int, TableName NVARCHAR(MAX))
insert into @stage
select c.name, o.name
from sys.columns c
join sys.objects o on o.object_id=c.object_id and o.type = 'u'
and c.system_type_id IN (select system_type_id from sys.types where [name] not in ('text','ntext','image'))
and o.name like @target
SET @maxi = (select max(i) from @stage)
while @i <= @maxi
BEGIN
set @column = (select col from @stage where i = @i)
set @table = (select tbl from @stage where i = @i)
SET @query = N'SELECT ' +''''+@column+''''+' , '+ @column
SELECT @query = @query + ', COUNT( ' + @column + ' ) as count' + @column + ' , ''' + @table + ''' as tablename'
select @query = @query + ' from ' + @table + ' group by ' + @column
--Select @query
insert into @results
EXEC sp_executesql @query
SET @i = @i + 1
END
select * from @results
; with cte as (
select *, ROW_NUMBER() over (partition by Columnname order by ColumnCount desc) as rn from @results
)
select * from cte where rn <=3
Upvotes: 1
Reputation: 48058
Start with this SQL Statement builder, and modify it to suit your liking:
EDIT Added Order by Desc
With ColumnSet As
(
Select TABLE_SCHEMA, TABLE_NAME, COLUMN_NAME
From INFORMATION_SCHEMA.COLUMNS
Where 1=1
And TABLE_NAME IN ('Table1')
And COLUMN_NAME IN ('Column1', 'Column2')
)
Select 'Select Top 3 ' + COLUMN_NAME + ', Count (*) NumInstances From ' + TABLE_SCHEMA + '.'+ TABLE_NAME + ' Group By ' + COLUMN_NAME + ' Order by Count (*) Desc'
From ColumnSet
Upvotes: 0