Martin AJ
Martin AJ

Reputation: 6707

Best database design to link related tags

Here is my current table structure for tags:

// tags
+----+------------+----------------------------------+----------+------------+
| id |    name    |            description           | used_num | date_time  |
+----+------------+----------------------------------+----------+------------+
| 1  | PHP        | some explanations for PHP        | 4234     | 1475028896 |
| 2  | SQL        | some explanations for SQL        | 734      | 1475048601 |
| 3  | jQuery     | some explanations for jQuery     | 434      | 1475068321 | 
| 4  | MySQL      | some explanations for MySQL      | 535      | 1475068332 |
| 5  | JavaScript | some explanations for JavaScript | 3325     | 1475071430 |
| 6  | HTML       | some explanations for HTML       | 2133     | 1475077842 |
| 7  | postgresql | some explanations for postgresql | 43       | 1475077851 |
+----+------------+----------------------------------+----------+------------+

As you know, some tags are related to each other. For example:

are related ones in table above. How can I make that relation between them? Should I add one more column? Should it be containing what thing? (since sometimes there are more than 2 related tags)

Upvotes: 0

Views: 129

Answers (2)

Antonín Lejsek
Antonín Lejsek

Reputation: 6103

For storing M:N relation in relation database You typically need another table. Table with two columns id1, id2, foreign keys to related table(s). (id1,id2) should be unique.

If You relate table to itself, there are more possibilities. If row A is related to B, is B related to A? If it is the case it is convenient to add constraint id1 < id2 on relation table so there is just one way to store the relation between two rows. Even if some insert does not play by the rules and tries to insert different order, data are safe from duplicities.

The most interesting question is transitivity. If A is related to B and B is related to C, is A related to C? I think it should be in this case. You do not have to store the relation A to C, as it can be infered, but calculating this every time You want to query it is somewhat impractical. I think the best way is to store all infered relations on update, but flag them as infered, so they can be recalculated on future changes. It makes adding and deleting of relations reversible. If You do not need that (relations do not get deleted), You can omit the flag or just use alternative approach:

What we have is in fact equivalence (reflexive, symmetric, transitive) and equivalence produces equvalence classes. You can store number in each row that would mark the equivalence class. Rows with the same number would be equal. This is what Shafizadeh proposed and I disagree with @Drew that it is the last thing to do. It can be good if used in the right case.

Upvotes: 0

Drew
Drew

Reputation: 24970

For Option 1 at least, please see Edit 1 at bottom for the actual INSERT strategy for the intersect.

Option 1

create table tags
(   id INT AUTO_INCREMENT PRIMARY KEY,
    name VARCHAR(60) NOT NULL,
    UNIQUE KEY `key_tags_name` (name)
    -- All the other columns
)ENGINE=InnoDB;

create table tagIntersects
(   id1 INT NOT NULL,
    id2 INT NOT NULL,
    PRIMARY KEY(id1,id2),
    KEY `ti_flipped` (id2,id1), -- flipped left-mode (thin size)
    FOREIGN KEY `fk_ti_id1` (id1) REFERENCES tags(id),
    FOREIGN KEY `fk_ti_id2` (id2) REFERENCES tags(id)
)ENGINE=InnoDB;

Option 2

create table tags
(   id INT AUTO_INCREMENT PRIMARY KEY,
    name VARCHAR(60) NOT NULL,
    UNIQUE KEY `key_tags_name` (name)
    -- All the other columns
)ENGINE=InnoDB;

create table tagIntersects
(   id INT AUTO_INCREMENT PRIMARY KEY,
    name1 VARCHAR(60) NOT NULL,
    name2 VARCHAR(60) NOT NULL,
    KEY `ti_Flipped` (name2,name1), -- these get costly (wide)
    FOREIGN KEY `fk_ti_id1` (name1) REFERENCES tags(name),
    FOREIGN KEY `fk_ti_id2` (name2) REFERENCES tags(name)
)ENGINE=InnoDB;

Option 3

create table tags
(   id INT AUTO_INCREMENT PRIMARY KEY,
    name VARCHAR(60) NOT NULL,
    UNIQUE KEY `key_tags_name` (name)
    -- All the other columns
)ENGINE=InnoDB;

create table tagIntersects
(   name1 VARCHAR(60) NOT NULL,
    name2 VARCHAR(60) NOT NULL,
    PRIMARY KEY (name1,name2),
    KEY `ti_Flipped` (name2,name1), -- these get costly (wide)
    FOREIGN KEY `fk_ti_id1` (name1) REFERENCES tags(name),
    FOREIGN KEY `fk_ti_id2` (name2) REFERENCES tags(name)
)ENGINE=InnoDB;

.

INSERT tags (name) VALUES ('PHP'),('PDO'),('MYSQLI'),('PHPMyAdmin');

I recommend Option 1. Below is all about Option 1

Fake load 200 tags with random names:

DROP PROCEDURE IF EXISTS tagDataLoad;
DELIMITER $$
CREATE PROCEDURE tagDataLoad()
BEGIN
    -- warning this is horribly slow
    DECLARE i INT DEFAULT 0;
    WHILE i<200 DO
        INSERT IGNORE tags(name)
        select concat(substring('ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789', rand()*36+1, 1),
              substring('ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789', rand()*36+1, 1),
              substring('ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789', rand()*36+1, 1),
              substring('ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789', rand()*36+1, 1),
              substring('ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789', rand()*36+1, 1),
              substring('ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789', rand()*36+1, 1),
              substring('ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789', rand()*36+1, 1),
              substring('ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789', rand()*36+1, 1)
             );
        -- poach the above from Gordon, link: https://stackoverflow.com/a/16738136
        SET i=i+1;
    END WHILE;
END$$
DELIMITER ;

call it:

CALL tagDataLoad(); -- load 200 fake tags
select * from tags; -- eyeball it
CALL tagDataLoad(); -- load more
CALL tagDataLoad(); -- load more
SELECT MIN(id),MAX(id),count(*) FROM tags;
-- 1 604 604

Fake load qty iCount number of fake tag intersects:

DROP PROCEDURE IF EXISTS tagIntersectDataLoad;
DELIMITER $$
CREATE PROCEDURE tagIntersectDataLoad(iCount INT)
BEGIN
    -- warning this is horribly slow
    -- don't pass a number greater than 100 until you time it
    DECLARE i INT DEFAULT 0;
    WHILE i<=iCount DO
        INSERT IGNORE tagIntersects(id1,id2)
        SELECT FLOOR(RAND()*600)+1,FLOOR(RAND()*600)+1;
        SET i=i+1;
    END WHILE;
END$$
DELIMITER ;

CALL tagIntersectDataLoad(100);
-- slow, I don't recommend a number above 100 until you time it 

After changing some 100's to larger counts I ended up with 10k row

select count(*) from tag Intersects;
-- 9900

I don't recommend you do that due to timeouts. But in the end I had the above

Half the reason for the fake load stored procs above are for just getting the table size high enough that indexes are even used. They aren't used for small tables. Also they give you a method to chg to other schemas and load data special for them. And then to profile the performance with your queries with 100k rows or tens of millions (depending on your needs).

See EXPLAIN plan:

explain 
select * from tagIntersects where id1=111 or id2=111;
+----+-------------+---------------+-------------+--------------------+--------------------+---------+------+------+----------------------------------------------+
| id | select_type | table         | type        | possible_keys      | key                | key_len | ref  | rows | Extra                                        |
+----+-------------+---------------+-------------+--------------------+--------------------+---------+------+------+----------------------------------------------+
|  1 | SIMPLE      | tagIntersects | index_merge | PRIMARY,ti_flipped | PRIMARY,ti_flipped | 4,4     | NULL |   27 | Using union(PRIMARY,ti_flipped); Using where |
+----+-------------+---------------+-------------+--------------------+--------------------+---------+------+------+----------------------------------------------+

explain

select * from tagIntersects where (id1=111 or id2=111) and (id1=500 or id2=500);
+----+-------------+---------------+-------------+--------------------+--------------------+---------+------+------+----------------------------------------------+
| id | select_type | table         | type        | possible_keys      | key                | key_len | ref  | rows | Extra                                        |
+----+-------------+---------------+-------------+--------------------+--------------------+---------+------+------+----------------------------------------------+
|  1 | SIMPLE      | tagIntersects | index_merge | PRIMARY,ti_flipped | PRIMARY,ti_flipped | 4,4     | NULL |   21 | Using union(PRIMARY,ti_flipped); Using where |
+----+-------------+---------------+-------------+--------------------+--------------------+---------+------+------+----------------------------------------------+

The EXPLAIN plan looks good above for typical queries. Note the thin key sizes (8 bytes total). The plan shows an index_merge / UNION of two-left most key usages: one with the PK and one with the flipped secondary index. That is the point of ti_flipped.

Also note that the FK keysizes are thin.

Note that tags.name can be readily updated from 'node.js' to 'nodejs' with no impact to the tags Primary Key. And that update would have zero impact on tagsInserted columns or keys.

Concerning using Option 2 or 3: the keys are wide. Changes to tags.name would have PK and FK changes as impacts not endured by Option 1 in the intersect table. Depending on your data size (say, something different thing than SO tags), with tens of millions of rows and thousands of intersects for one name, that change could be felt in the UX. For small to medium size, not much to worry about but Profile the impact.

So generally I opts for an Option 1 approach due to the enormity of my data sets and seeking to keep keys for relationships thin.

Spencer7593 mentioned in a comment recently that large varchars have negative impact on internal memory during joins and with an impact of subqueries / derived's that manifest in temporary tables. Another reason for thin FK's.

So this has as much to do with the readers in general who have different schemas and think there is no impact on performance with them.

So profile the performance of your queries before you finalize on a schema (of huge tables especially).

Edit 1

create table tags
(   id INT AUTO_INCREMENT PRIMARY KEY,
    name VARCHAR(60) NOT NULL,
    UNIQUE KEY `key_tags_name` (name) -- 
    -- All the other columns
)ENGINE=InnoDB;

create table tagIntersects
(   id1 INT NOT NULL,
    id2 INT NOT NULL,
    PRIMARY KEY(id1,id2),
    KEY `ti_flipped` (id2,id1),
    FOREIGN KEY `fk_ti_id1` (id1) REFERENCES tags(id),
    FOREIGN KEY `fk_ti_id2` (id2) REFERENCES tags(id)
)ENGINE=InnoDB;

load:

insert tags(name) values ('tag1'),('tag2'); -- ids 1 and 2

Now I have two id's in some programming language I want to intersect the tags.

Using MySQL as a programming language, let's just call them the following vars:

set @myId1=2; -- actually run this so it is not NULL
set @myId2=1; -- actually run this so it is not NULL

Note that they could be reversed, it does not matter. Now assuming you did not screw up on the programming language such that @myId1 = @myId2 (note the below will still work, but just sayin')

insert tagIntersects(id1,id2) select LEAST(@myId1,@myId2),GREATEST(@myId1,@myId2); -- ok
insert tagIntersects(id1,id2) select LEAST(@myId1,@myId2),GREATEST(@myId1,@myId2); -- GOOD it failed

-- flip em:

set @myId1=1; -- actually run this so it is not NULL
set @myId2=2; -- actually run this so it is not NULL

insert tagIntersects(id1,id2) select LEAST(@myId1,@myId2),GREATEST(@myId1,@myId2); -- GOOD it failed

Your data stays clean. Clean means you would not have two rows in there that dupe up and pollute your data such as an intersect row for MYSQL / SQL-SERVER ... and another row for SQL-SERVER / MYSQL in id1, id2 respectfully.

Edit 2

Question from user Shafizadeh : Ok, you have three tags, tag1, tag2, tag3 .. they are related. So there is three rows in the tagIntersects table like these: tag1,tag2, tag1,tag3, tag2,tag3. Right? Now I want to select all related tags with tag1. Write the query ... :-) seems like a nightmare, huh?

Answer:

explain select * from tagIntersects where id1=2 or id2=2; 
+----+-------------+---------------+-------------+--------------------+--------------------+---------+------+------+----------------------------------------------+
| id | select_type | table         | type        | possible_keys      | key                | key_len | ref  | rows | Extra                                        |
+----+-------------+---------------+-------------+--------------------+--------------------+---------+------+------+----------------------------------------------+
|  1 | SIMPLE      | tagIntersects | index_merge | PRIMARY,ti_flipped | PRIMARY,ti_flipped | 4,4     | NULL |   37 | Using union(PRIMARY,ti_flipped); Using where |
+----+-------------+---------------+-------------+--------------------+--------------------+---------+------+------+----------------------------------------------+

My question back to you is, with your CSV, what is your explain plan? It would look awful and a tablescan.

Upvotes: 2

Related Questions