stevendesu
stevendesu

Reputation: 16771

Efficient search by multiple tags in MySQL?

I have a simple database schema and sample like so:

CREATE TABLE Media (
    id INT AUTO_INCREMENT PRIMARY KEY,
    file VARCHAR(255)
);

CREATE TABLE Tag (
    id INT AUTO_INCREMENT PRIMARY KEY,
    label VARCHAR(255)
);

CREATE TABLE Media_Tag (
    media_id INT,
    tag_id INT,
    PRIMARY KEY(media_id, tag_id)
);

INSERT INTO Media VALUES
    (1, "firetruck.jpg"),
    (2, "duck.jpg"),
    (3, "apple.jpg"),
    (4, "banana.jpg");

INSERT INTO Tag VALUES
    (1, "red"),
    (2, "yellow"),
    (3, "mobile"),
    (4, "immobile");

INSERT INTO Media_Tag VALUES
    (1, 1),
    (1, 3),
    (2, 2),
    (2, 3),
    (3, 1),
    (3, 4),
    (4, 2),
    (4, 4);

If I want to perform a search for a single tag, that's pretty easy:

SELECT
    m.*
FROM
    Media m
    LEFT JOIN Media_Tag mt ON mt.media_id = m.id
    LEFT JOIN Tag t ON mt.tag_id = t.id
WHERE
    t.label = ?

However I'm interested in doing a search by two tags. For instance, if a user searched for "red" and "mobile", I want it to only return "firetruck.jpg" and not "apple.jpg" (just red) or "duck.jpg" (just mobile)


So far I've been working on a solution like the following:

SELECT
    m.*
FROM
    Media m
    LEFT JOIN Media_Tag mt1 ON mt1.media_id = m.id
    LEFT JOIN Media_Tag mt2 ON mt1.media_id = mt2.media_id AND mt1.tag_id <> mt2.tag_id
    LEFT JOIN Tag t1 ON t1.id = mt1.tag_id
    LEFT JOIN Tag t2 ON t2.id = mt2.tag_id
WHERE
    t1.label = ? AND
    t2.label = ?

This works (and is pretty fast) except that I have to add two additional JOIN clauses for every tag added to the search parameters. If I don't know how many search parameters will be passed going in, I need to create a query with a "maximum" number of allowed search parameters by pre-joining X number of tables.

Is there a better solution?

I've been toying around with an idea like:

SELECT
    m.*
FROM
    Media m
    LEFT JOIN Media_Tag mt ON mt.media_id = m.id
    LEFT JOIN Tag t ON mt.tag_id = t.id
WHERE
    t.label IN ("red", "mobile")
GROUP BY
    <all fields on m>
HAVING
    COUNT(*) = <count-of-parameters>

But I ran into two issues when using this in MySQL Workbench on a sample dataset of 500,000 rows:

  1. The solution with multiple JOINs ran in 0.002 seconds, whereas the solution with GROUP BY and HAVING took a whole 3 seconds
  2. The results of the GROUP BY solution seemed to be in a random order, whereas the results of the multiple JOIN solution came back in the primary key order of the Media table

I'm not entirely sure why the solution is so incredibly slow. Maybe there's something I don't understand about how HAVING clauses work internally. But regardless, the results coming back in a seemingly random order make this an unusable solution because I fear it will break pagination.


Update 1:

I've learned that the solution with multiple JOINs running in 0.002 seconds on my 500k data set was a bit of a fluke. The script that I used to add the data added a Media item, then its tags. This meant that all tags for the first 100 media items could be found at the top of the tags table. When I performed my search, I had a LIMIT 0,25 clause to mimic pagination. This was ending my query early when it found 25 media items that matched, all of which could be found at the top of the tags table.

The HAVING solution, on the other hand, was scanning the whole tags table. This explains the 3 seconds -- that's just how long it takes to scan a table of 1 million rows.

If I modified my search to something that returned fewer than 25 Media items then suddenly it had to scan the whole table and couldn't exit early, and the JOIN solution took 3 seconds, as well.

Update 2:

I don't think I was clear in my original post, so I want to expand on it. My priority here is efficiency, not data integrity, code simplicity, or normalization. My current database schema is normalized, but I'm willing to de-normalize it if that would allow for a more efficient search.

As an experiment I amended my Media table with one new field:

UPDATE TABLE Media ADD COLUMN all_tags varchar(255);

UPDATE
    Media m
    INNER JOIN (
        SELECT
            m.id,
            GROUP_CONCAT(t.label ORDER BY t.label ASC) as all_tags
        FROM
            Media b
            LEFT JOIN Media_Tag mt ON mt.media_id = m.id
            LEFT JOIN Tag t ON mt.tag_id = t.id
        GROUP BY
            m.id
        ORDER BY
            m.id
    ) j ON j.id = m.id
    SET m.all_tags = j.all_tags;

My new table looks like so:

+----+---------------+-----------------+
| id |      file     |     all_tags    |
+----+---------------+-----------------+
|  1 | firetruck.jpg |   mobile,red    |
|  2 |    duck.jpg   |  mobile,yellow  |
|  3 |   apple.jpg   |   immobile,red  |
|  4 |   banana.jpg  | immobile,yellow |
+----+---------------+-----------------+

I can then perform a search for tags like so:

SELECT * FROM Media WHERE all_tags LIKE "%tag1%tag2%...%";

So long as tag1, tag2, etc. are in alphabetical order (just like all_tags is in alphabetical order) then this will work.

This was able to perform full-table searches (searches which returned fewer than the pagination limit) in about 350 milliseconds on my data set of 500k Media items. Much better, but still not where I want to it be. I'm aiming for a response time under 100 milliseconds if possible.

Just for fun I added an index on the all_tags column and did an exact match search:

SELECT * FROM Media WHERE all_tags = "mobile,red";

This finished in 0.2 milliseconds. Unfortunately I can't rely on exact matches. Someone searching for the two tags "mobile" and "red" should also turn up a Media item with the three tags "cat", "mobile", and "red" -- and since "cat" comes before "mobile" alphabetically, the only way to ensure this turns up in the result set is with a starting wildcard in my LIKE clause, which prevents the use of an index.

I've been trying to think of more clever solutions, like having 26 columns for "all_tags_starting_with_A", "all_tags_starting_with_B", etc -- but I can't seem to wrap my head around the best option.

Upvotes: 3

Views: 729

Answers (1)

forpas
forpas

Reputation: 164069

The solution with GROUP BY is certainly easier to maintain so it's worth trying it but applied only to the join of Media_Tag and Tag and joining the results to Media:

SELECT m.*
FROM Media m
INNER JOIN (
  SELECT mt.media_id
  FROM Media_Tag mt INNER JOIN Tag t 
  ON mt.tag_id = t.id
  WHERE t.label IN ('red', 'mobile')
  GROUP BY mt.media_id
  HAVING COUNT(*) = 2
) t ON t.media_id = m.id;

I changed all the joins to INNER because I don't see the point of LEFT joins.
Or with the operator IN instead of a join to Media:

SELECT m.*
FROM Media m
WHERE m.id IN (
  SELECT mt.media_id
  FROM Media_Tag mt INNER JOIN Tag t 
  ON mt.tag_id = t.id
  WHERE t.label IN ('red', 'mobile')
  GROUP BY mt.media_id
  HAVING COUNT(*) = 2
);

Upvotes: 2

Related Questions