Oleg Ivanytskyi
Oleg Ivanytskyi

Reputation: 1081

How to find the most common value in each column

I have a table that looks like this:

category    name1    name2    name3    name4    name5

  1         John      Sam     John     Katy      Cat
  1         John      Ivan    Bob      Andrew    Tom
  1         Sam       Ivan    George   Bob       Tom
  2         Jack      Siri    Elsa     Noah      Anna
  2         Jack      Bob     Tomas    Noah      Tom

What I need to do, is to find the most common value in each column for each category. That is, I need the following result:

category    name1    name2    name3    name4    name5

  1         John      Ivan    John     Katy      Tom
  2         Jack      Siri    Elsa     Noah      Anna

If there are several values with the same frequency, any of them may be chosen.

So far, I have only managed to do that for one column with this script:

SELECT top(1) category, name1, COUNT(name1) AS freq
FROM data
GROUP BY category, name1
ORDER BY freq DESC

But how do I do that for several columns in SQL Server?

Upvotes: 0

Views: 147

Answers (3)

Gordon Linoff
Gordon Linoff

Reputation: 1269873

If you don't mind the results in rows, you can unpivot, which makes this much simpler:

select category, which, name
from (select t.category, v.which, v.name, count(*) as cnt,
             row_number() over (partition by t.category, v.which order by count(*) desc) as seqnum
      from t cross apply
           (values (1, name1), (2, name2), (3, name3), (4, name4), (5, name4)
           ) v(which, name)
      from t
      group by t.category, v.which, v.name
      ) cwn
where seqnum = 1;

You can re-pivot if you want this in columns:

with cwn as (
      select t.category, v.which, v.name, count(*) as cnt,
             row_number() over (partition by t.category, v.which order by count(*) desc) as seqnum
      from t cross apply
           (values (1, name1), (2, name2), (3, name3), (4, name4), (5, name4)
           ) v(which, name)
      from t
      group by t.category, v.which, v.name
     )
select category,
       max(case when which = 1 then name end) as name1,
       max(case when which = 2 then name end) as name2,
       max(case when which = 3 then name end) as name3,
       max(case when which = 4 then name end) as name4,
       max(case when which = 5 then name end) as name5
from cwn
where seqnum = 1
group by category

Upvotes: 1

MatBailie
MatBailie

Reputation: 86735

One option, with lots of repetition, but accommodating your current structure...

(Though it still has the same non-deterministic/arbitrary behaviour when dealing with equally frequent names)

WITH
  counted AS
(
  SELECT
    category,
    name1,
    COUNT(*) OVER (PARTITION BY category, name1)  AS name1_freq,
    name2,
    COUNT(*) OVER (PARTITION BY category, name2)  AS name2_freq
  FROM
    yourTable
),
  ranked AS
(
  SELECT
    category,
    name1,
    ROW_NUMBER() OVER (PARTITION BY category ORDER BY name1_freq DESC)  AS name1_rank,
    name2,
    ROW_NUMBER() OVER (PARTITION BY category ORDER BY name2_freq DESC)  AS name2_rank
  FROM
    counted
)
SELECT
  category,
  MAX(CASE WHEN name1_rank = 1 THEN name1 END)  AS name1_most_common,
  MAX(CASE WHEN name2_rank = 1 THEN name2 END)  AS name2_most_common
FROM
  ranked
GROUP BY
  category

As you can see, that's a LOT of repetition. Which is why SQL is predicated on normalised data-structures. So much so that normalising then denormalising your structure may be a valid option...

WITH
  normalised(category, col, name) AS
(
            SELECT category, 1, name1 FROM yourTable 
  UNION ALL SELECT category, 2, name2 FROM yourTable
),
  counted AS
(
  SELECT
    category, col, name, COUNT(*) AS freq
  FROM
    normalised
  GROUP BY
    category, col, name
),
  ranked AS
(
  SELECT
    category, col, name,
    ROW_NUMBER() OVER (PARTITION BY category, col ORDER BY freq DESC)  AS rank
  FROM
    counted
)
SELECT
  category,
  MAX(CASE WHEN col = 1 THEN name END)  AS name1_most_common,
  MAX(CASE WHEN col = 2 THEN name END)  AS name2_most_common
FROM
  ranked
WHERE
  rank = 1
GROUP BY
  category

Upvotes: 1

forpas
forpas

Reputation: 164099

First create a CTE that uses COUNT() window function to return the number of times each name occurs in each category and then use FIRST_VALUE() window function to get for each column the name that occurs the most:

WITH cte AS (
  SELECT *,
         COUNT(*) OVER (PARTITION BY category, name1) count1,
         COUNT(*) OVER (PARTITION BY category, name2) count2,
         COUNT(*) OVER (PARTITION BY category, name3) count3,
         COUNT(*) OVER (PARTITION BY category, name4) count4,
         COUNT(*) OVER (PARTITION BY category, name5) count5
  FROM tablename
)
SELECT DISTINCT category,
       FIRST_VALUE(name1) OVER (PARTITION BY category ORDER BY count1 DESC) name1,
       FIRST_VALUE(name2) OVER (PARTITION BY category ORDER BY count2 DESC) name2,
       FIRST_VALUE(name3) OVER (PARTITION BY category ORDER BY count3 DESC) name3,
       FIRST_VALUE(name4) OVER (PARTITION BY category ORDER BY count4 DESC) name4,
       FIRST_VALUE(name5) OVER (PARTITION BY category ORDER BY count5 DESC) name5
FROM cte

See the demo.

Upvotes: 4

Related Questions