Reputation: 16771
I have a simple database schema and sample like so:
CREATE TABLE Media (
id INT AUTO_INCREMENT PRIMARY KEY,
file VARCHAR(255)
);
CREATE TABLE Tag (
id INT AUTO_INCREMENT PRIMARY KEY,
label VARCHAR(255)
);
CREATE TABLE Media_Tag (
media_id INT,
tag_id INT,
PRIMARY KEY(media_id, tag_id)
);
INSERT INTO Media VALUES
(1, "firetruck.jpg"),
(2, "duck.jpg"),
(3, "apple.jpg"),
(4, "banana.jpg");
INSERT INTO Tag VALUES
(1, "red"),
(2, "yellow"),
(3, "mobile"),
(4, "immobile");
INSERT INTO Media_Tag VALUES
(1, 1),
(1, 3),
(2, 2),
(2, 3),
(3, 1),
(3, 4),
(4, 2),
(4, 4);
If I want to perform a search for a single tag, that's pretty easy:
SELECT
m.*
FROM
Media m
LEFT JOIN Media_Tag mt ON mt.media_id = m.id
LEFT JOIN Tag t ON mt.tag_id = t.id
WHERE
t.label = ?
However I'm interested in doing a search by two tags. For instance, if a user searched for "red" and "mobile", I want it to only return "firetruck.jpg" and not "apple.jpg" (just red) or "duck.jpg" (just mobile)
So far I've been working on a solution like the following:
SELECT
m.*
FROM
Media m
LEFT JOIN Media_Tag mt1 ON mt1.media_id = m.id
LEFT JOIN Media_Tag mt2 ON mt1.media_id = mt2.media_id AND mt1.tag_id <> mt2.tag_id
LEFT JOIN Tag t1 ON t1.id = mt1.tag_id
LEFT JOIN Tag t2 ON t2.id = mt2.tag_id
WHERE
t1.label = ? AND
t2.label = ?
This works (and is pretty fast) except that I have to add two additional JOIN
clauses for every tag added to the search parameters. If I don't know how many search parameters will be passed going in, I need to create a query with a "maximum" number of allowed search parameters by pre-joining X number of tables.
Is there a better solution?
I've been toying around with an idea like:
SELECT
m.*
FROM
Media m
LEFT JOIN Media_Tag mt ON mt.media_id = m.id
LEFT JOIN Tag t ON mt.tag_id = t.id
WHERE
t.label IN ("red", "mobile")
GROUP BY
<all fields on m>
HAVING
COUNT(*) = <count-of-parameters>
But I ran into two issues when using this in MySQL Workbench on a sample dataset of 500,000 rows:
JOIN
s ran in 0.002 seconds, whereas the solution with GROUP BY
and HAVING
took a whole 3 secondsGROUP BY
solution seemed to be in a random order, whereas the results of the multiple JOIN
solution came back in the primary key order of the Media tableI'm not entirely sure why the solution is so incredibly slow. Maybe there's something I don't understand about how HAVING
clauses work internally. But regardless, the results coming back in a seemingly random order make this an unusable solution because I fear it will break pagination.
I've learned that the solution with multiple JOIN
s running in 0.002 seconds on my 500k data set was a bit of a fluke. The script that I used to add the data added a Media item, then its tags. This meant that all tags for the first 100 media items could be found at the top of the tags table. When I performed my search, I had a LIMIT 0,25
clause to mimic pagination. This was ending my query early when it found 25 media items that matched, all of which could be found at the top of the tags table.
The HAVING
solution, on the other hand, was scanning the whole tags table. This explains the 3 seconds -- that's just how long it takes to scan a table of 1 million rows.
If I modified my search to something that returned fewer than 25 Media items then suddenly it had to scan the whole table and couldn't exit early, and the JOIN
solution took 3 seconds, as well.
I don't think I was clear in my original post, so I want to expand on it. My priority here is efficiency, not data integrity, code simplicity, or normalization. My current database schema is normalized, but I'm willing to de-normalize it if that would allow for a more efficient search.
As an experiment I amended my Media
table with one new field:
UPDATE TABLE Media ADD COLUMN all_tags varchar(255);
UPDATE
Media m
INNER JOIN (
SELECT
m.id,
GROUP_CONCAT(t.label ORDER BY t.label ASC) as all_tags
FROM
Media b
LEFT JOIN Media_Tag mt ON mt.media_id = m.id
LEFT JOIN Tag t ON mt.tag_id = t.id
GROUP BY
m.id
ORDER BY
m.id
) j ON j.id = m.id
SET m.all_tags = j.all_tags;
My new table looks like so:
+----+---------------+-----------------+
| id | file | all_tags |
+----+---------------+-----------------+
| 1 | firetruck.jpg | mobile,red |
| 2 | duck.jpg | mobile,yellow |
| 3 | apple.jpg | immobile,red |
| 4 | banana.jpg | immobile,yellow |
+----+---------------+-----------------+
I can then perform a search for tags like so:
SELECT * FROM Media WHERE all_tags LIKE "%tag1%tag2%...%";
So long as tag1, tag2, etc. are in alphabetical order (just like all_tags
is in alphabetical order) then this will work.
This was able to perform full-table searches (searches which returned fewer than the pagination limit) in about 350 milliseconds on my data set of 500k Media items. Much better, but still not where I want to it be. I'm aiming for a response time under 100 milliseconds if possible.
Just for fun I added an index on the all_tags
column and did an exact match search:
SELECT * FROM Media WHERE all_tags = "mobile,red";
This finished in 0.2 milliseconds. Unfortunately I can't rely on exact matches. Someone searching for the two tags "mobile" and "red" should also turn up a Media item with the three tags "cat", "mobile", and "red" -- and since "cat" comes before "mobile" alphabetically, the only way to ensure this turns up in the result set is with a starting wildcard in my LIKE
clause, which prevents the use of an index.
I've been trying to think of more clever solutions, like having 26 columns for "all_tags_starting_with_A", "all_tags_starting_with_B", etc -- but I can't seem to wrap my head around the best option.
Upvotes: 3
Views: 729
Reputation: 164069
The solution with GROUP BY
is certainly easier to maintain so it's worth trying it but applied only to the join of Media_Tag
and Tag
and joining the results to Media
:
SELECT m.*
FROM Media m
INNER JOIN (
SELECT mt.media_id
FROM Media_Tag mt INNER JOIN Tag t
ON mt.tag_id = t.id
WHERE t.label IN ('red', 'mobile')
GROUP BY mt.media_id
HAVING COUNT(*) = 2
) t ON t.media_id = m.id;
I changed all the joins to INNER
because I don't see the point of LEFT
joins.
Or with the operator IN
instead of a join to Media
:
SELECT m.*
FROM Media m
WHERE m.id IN (
SELECT mt.media_id
FROM Media_Tag mt INNER JOIN Tag t
ON mt.tag_id = t.id
WHERE t.label IN ('red', 'mobile')
GROUP BY mt.media_id
HAVING COUNT(*) = 2
);
Upvotes: 2