Reputation: 13029
How do I get the most frequently occurring category for each tag in MySQL? Ideally, I would want to simulate an aggregate function that would calculate the mode of a column.
SELECT
t.tag
, s.category
FROM tags t
LEFT JOIN stuff s
USING (id)
ORDER BY tag;
+------------------+----------+
| tag | category |
+------------------+----------+
| automotive | 8 |
| ba | 8 |
| bamboo | 8 |
| bamboo | 8 |
| bamboo | 8 |
| bamboo | 8 |
| bamboo | 8 |
| bamboo | 10 |
| bamboo | 8 |
| bamboo | 9 |
| bamboo | 8 |
| bamboo | 10 |
| bamboo | 8 |
| bamboo | 9 |
| bamboo | 8 |
| banana tree | 8 |
| banana tree | 8 |
| banana tree | 8 |
| banana tree | 8 |
| bath | 9 |
+-----------------------------+
Upvotes: 9
Views: 7772
Reputation: 26953
Another way to approximate this is to pull back as many of the categories as you can:
SELECT
t.tag
, group_concat(s.category SEPARATOR '%%%') AS category_selection
FROM tags t
LEFT JOIN stuff s
USING (id)
GROUP BY tag
ORDER BY tag
;
Then in your application code, count up the most frequent items and make your selection based on that:
foreach ($output as &$row) {
$category_counts = array();
$most_frequent_category = false;
foreach (explode('%%%', $row['category_selection']) as $category) {
if (!isset($category_counts[$category])) {
$category_counts[$category] = 1;
} else {
$category_counts[$category] += 1;
}
if ($most_frequent_category === false || $category_counts[$category] > $category_counts[$most_frequent_category]) {
$most_frequent_category = $category;
}
}
$row['category'] = $most_frequent_category;
unset($row['category_selection']);
}
CAVEAT: mariadb at least has a limit on how many chars are allowed in group_concat: https://mariadb.com/kb/en/server-system-variables/#group_concat_max_len
Upvotes: 0
Reputation: 26953
Here's a hacky approach to this which utilizes the max
aggregate function seeing as there is no mode aggregate function in MySQL (or windowing functions etc.) that would allow this:
SELECT
tag,
convert(substring(max(concat(lpad(c, 20, '0'), category)), 21), int)
AS most_frequent_category
FROM (
SELECT tag, category, count(*) AS c
FROM tags INNER JOIN stuff using (id)
GROUP BY tag, category
) as grouped_cats
GROUP BY tag;
Basically it utilizes the fact that we can find the lexical max of the counts of each individual category.
This is easier to see with named categories:
create temporary table tags (id int auto_increment primary key, tag character varying(20));
create temporary table stuff (id int, category character varying(20));
insert into tags (tag) values ('automotive'), ('ba'), ('bamboo'), ('bamboo'), ('bamboo'), ('bamboo'), ('bamboo'), ('bamboo'), ('bamboo'), ('bamboo'), ('bamboo'), ('bamboo'), ('bamboo'), ('bamboo'), ('bamboo'), ('banana tree'), ('banana tree'), ('banana tree'), ('banana tree'), ('bath');
insert into stuff (id, category) values (1, 'cat-8'), (2, 'cat-8'), (3, 'cat-8'), (4, 'cat-8'), (5, 'cat-8'), (6, 'cat-8'), (7, 'cat-8'), (8, 'cat-10'), (9, 'cat-8'), (10, 'cat-9'), (11, 'cat-8'), (12, 'cat-10'), (13, 'cat-8'), (14, 'cat-9'), (15, 'cat-8'), (16, 'cat-8'), (17, 'cat-8'), (18, 'cat-8'), (19, 'cat-8'), (20, 'cat-9');
In which case we shouldn't be doing integer conversion on the most_frequent_category
column:
SELECT
tag,
substring(max(concat(lpad(c, 20, '0'), category)), 21) AS most_frequent_category
FROM (
SELECT tag, category, count(*) AS c
FROM tags INNER JOIN stuff using (id)
GROUP BY tag, category
) as grouped_cats
GROUP BY tag;
+-------------+------------------------+
| tag | most_frequent_category |
+-------------+------------------------+
| automotive | cat-8 |
| ba | cat-8 |
| bamboo | cat-8 |
| banana tree | cat-8 |
| bath | cat-9 |
+-------------+------------------------+
And to delve a little bit more into what is going on, here's what the grouped_cats
inner select looks like (I've added order by tag, c desc
):
+-------------+----------+---+
| tag | category | c |
+-------------+----------+---+
| automotive | cat-8 | 1 |
| ba | cat-8 | 1 |
| bamboo | cat-8 | 9 |
| bamboo | cat-10 | 2 |
| bamboo | cat-9 | 2 |
| banana tree | cat-8 | 4 |
| bath | cat-9 | 1 |
+-------------+----------+---+
And we can see how the max of the count(*)
column drags along it's associated category if we omit the substring
bit:
SELECT
tag,
max(concat(lpad(c, 20, '0'), category)) AS xmost_frequent_category
FROM (
SELECT tag, category, count(*) AS c
FROM tags INNER JOIN stuff using (id)
GROUP BY tag, category
) as grouped_cats
GROUP BY tag;
+-------------+---------------------------+
| tag | xmost_frequent_category |
+-------------+---------------------------+
| automotive | 00000000000000000001cat-8 |
| ba | 00000000000000000001cat-8 |
| bamboo | 00000000000000000009cat-8 |
| banana tree | 00000000000000000004cat-8 |
| bath | 00000000000000000001cat-9 |
+-------------+---------------------------+
Upvotes: 0
Reputation: 952
This is for simpler situations:
SELECT action, COUNT(action) AS ActionCount
FROM log
GROUP BY action
ORDER BY ActionCount DESC;
Upvotes: 1
Reputation:
(Edit: forgot DESC in ORDER BYs)
Easy to do with a LIMIT in the subquery. Does MySQL still have the no-LIMIT-in-subqueries restriction? Below example is using PostgreSQL.
=> select tag, (select category from stuff z where z.tag = s.tag group by tag, category order by count(*) DESC limit 1) AS category, (select count(*) from stuff z where z.tag = s.tag group by tag, category order by count(*) DESC limit 1) AS num_items from stuff s group by tag;
tag | category | num_items
------------+----------+-----------
ba | 8 | 1
automotive | 8 | 1
bananatree | 8 | 4
bath | 9 | 1
bamboo | 8 | 9
(5 rows)
Third column is only necessary if you need the count.
Upvotes: 3
Reputation: 562791
SELECT t1.*
FROM (SELECT tag, category, COUNT(*) AS count
FROM tags INNER JOIN stuff USING (id)
GROUP BY tag, category) t1
LEFT OUTER JOIN
(SELECT tag, category, COUNT(*) AS count
FROM tags INNER JOIN stuff USING (id)
GROUP BY tag, category) t2
ON (t1.tag = t2.tag AND (t1.count < t2.count
OR t1.count = t2.count AND t1.category < t2.category))
WHERE t2.tag IS NULL
ORDER BY t1.count DESC;
I agree this is kind of too much for a single SQL query. Any use of GROUP BY
inside a subquery makes me wince. You can make it look simpler by using views:
CREATE VIEW count_per_category AS
SELECT tag, category, COUNT(*) AS count
FROM tags INNER JOIN stuff USING (id)
GROUP BY tag, category;
SELECT t1.*
FROM count_per_category t1
LEFT OUTER JOIN count_per_category t2
ON (t1.tag = t2.tag AND (t1.count < t2.count
OR t1.count = t2.count AND t1.category < t2.category))
WHERE t2.tag IS NULL
ORDER BY t1.count DESC;
But it's basically doing the same work behind the scenes.
You comment that you could do a similar operation easily in application code. So why don't you do that? Do the simpler query to get the counts per category:
SELECT tag, category, COUNT(*) AS count
FROM tags INNER JOIN stuff USING (id)
GROUP BY tag, category;
And sort through the result in application code.
Upvotes: 4
Reputation: 425723
SELECT tag, category
FROM (
SELECT @tag <> tag AS _new,
@tag := tag AS tag,
category, COUNT(*) AS cnt
FROM (
SELECT @tag := ''
) vars,
stuff
GROUP BY
tag, category
ORDER BY
tag, cnt DESC
) q
WHERE _new
On your data, this returns the following:
'automotive', 8
'ba', 8
'bamboo', 8
'bananatree', 8
'bath', 9
Here's the test script:
CREATE TABLE stuff (tag VARCHAR(20) NOT NULL, category INT NOT NULL);
INSERT
INTO stuff
VALUES
('automotive',8),
('ba',8),
('bamboo',8),
('bamboo',8),
('bamboo',8),
('bamboo',8),
('bamboo',8),
('bamboo',10),
('bamboo',8),
('bamboo',9),
('bamboo',8),
('bamboo',10),
('bamboo',8),
('bamboo',9),
('bamboo',8),
('bananatree',8),
('bananatree',8),
('bananatree',8),
('bananatree',8),
('bath',9);
Upvotes: 4