Alistair Knock
Alistair Knock

Reputation: 1836

Select most frequently occurring records using two or more grouping columns

I have a table which has no primary key and I can't add one - the relevant columns from it are:

Department   | Category  | 
-------------+-----------+
0001         | A         |
0002         | D         |
0003         | A         | 
0003         | A         |
0003         | C         |
0004         | B         |

I want to retrieve a single row for each Department, which gives me the department code and the Category which appears most frequently in the table, i.e.

Department   | Category  | 
-------------+-----------+
0001         | A         |
0002         | D         |
0003         | A         | 
0004         | B         |

What is the best way to achieve this? My current attempt involves a Count(Category) in a subquery from which the Max(CountofCategory) is then taken, but including the Category field at this stage means too many rows at returned (since GROUP BY is applied at Category level as well as Department). In the case of a tie, I'd just select the min/max of the category arbitrarily. Ideally this should be database-agnostic, but is likely to run on either Oracle or MySQL.

Upvotes: 1

Views: 6691

Answers (4)

Linda Pulickal
Linda Pulickal

Reputation: 21

You can try the following as well. Window here returns the Categories ordered by their descending frequency of match against each Department. FIRST_VALUE() picks the first one from this.

SELECT DISTINCT (department), 
  FIRST_VALUE(category) OVER
    (PARTITION BY department ORDER BY count(*) DESC ROWS UNBOUNDED PRECEDING)
FROM T
GROUP BY department, category;

Upvotes: 1

Merlin
Merlin

Reputation: 1

There's an easier way:

select department, stats_mode(category) from T ;

works great when only the most frequent value is needed, when you need 2nd, 3rd... most frequent you have to do the counting like above.

Upvotes: 0

Shannon Severance
Shannon Severance

Reputation: 18410

Works in both Oracle and SQL Server, I believe is all standard SQL, from later standards:

with T_with_RN as
    (select Department
        , Category
        , row_number() over (partition by Department order by count(*) Desc) as RN
    from T
    group by Department, Category)
select Department, Category
from T_with_RN
where RN = 1

EDIT I don't know why I used the WITH, the solution is probably easier to read using an inline view:

select Department, Category
from (select Department
    , Category
    , row_number() over (partition by Department order by count(*) Desc) as RN
    from T
    group by Department, Category) T_with_RN
where RN = 1

END EDIT

Test cases:

create table T (
    Department varchar(10) null,
    Category varchar(10) null
);

-- Original test case
insert into T values ('0001', 'A');
insert into T values ('0002', 'D');
insert into T values ('0003', 'A');
insert into T values ('0003', 'A');
insert into T values ('0003', 'C');
insert into T values ('0004', 'B');
-- Null Test cases:
insert into T values (null, 'A');
insert into T values (null, 'B');
insert into T values (null, 'B');
insert into T values ('0005', null);
insert into T values ('0005', null);
insert into T values ('0005', 'X');
-- Tie Test case
insert into T values ('0006', 'O');
insert into T values ('0006', 'P');

Upvotes: 3

Nicole
Nicole

Reputation: 33197

You'll have to clean this up if you are better with subqueries than I am, but in my testing this produced the result you want:

SELECT
  main.Department as Department,
  (SELECT 
     Category
   FROM yourtable
   WHERE Department=main.Department
   GROUP BY Category
   ORDER BY COUNT(Category) DESC
   LIMIT 1) AS Category
FROM yourtable main
GROUP BY main.Department

The trick is just to get the one row in the subquery to return the max value you want with the ORDER BY and the "LIMIT 1"

Upvotes: 0

Related Questions