Reputation: 6707
Here is my current table structure for tags:
// tags
+----+------------+----------------------------------+----------+------------+
| id | name | description | used_num | date_time |
+----+------------+----------------------------------+----------+------------+
| 1 | PHP | some explanations for PHP | 4234 | 1475028896 |
| 2 | SQL | some explanations for SQL | 734 | 1475048601 |
| 3 | jQuery | some explanations for jQuery | 434 | 1475068321 |
| 4 | MySQL | some explanations for MySQL | 535 | 1475068332 |
| 5 | JavaScript | some explanations for JavaScript | 3325 | 1475071430 |
| 6 | HTML | some explanations for HTML | 2133 | 1475077842 |
| 7 | postgresql | some explanations for postgresql | 43 | 1475077851 |
+----+------------+----------------------------------+----------+------------+
As you know, some tags are related to each other. For example:
SQL
, MySQL
, postgresql
JavaScript
, jQuery
are related ones in table above. How can I make that relation between them? Should I add one more column? Should it be containing what thing? (since sometimes there are more than 2 related tags)
Upvotes: 0
Views: 129
Reputation: 6103
For storing M:N relation in relation database You typically need another table. Table with two columns id1, id2, foreign keys to related table(s). (id1,id2) should be unique.
If You relate table to itself, there are more possibilities. If row A is related to B, is B related to A? If it is the case it is convenient to add constraint id1 < id2 on relation table so there is just one way to store the relation between two rows. Even if some insert does not play by the rules and tries to insert different order, data are safe from duplicities.
The most interesting question is transitivity. If A is related to B and B is related to C, is A related to C? I think it should be in this case. You do not have to store the relation A to C, as it can be infered, but calculating this every time You want to query it is somewhat impractical. I think the best way is to store all infered relations on update, but flag them as infered, so they can be recalculated on future changes. It makes adding and deleting of relations reversible. If You do not need that (relations do not get deleted), You can omit the flag or just use alternative approach:
What we have is in fact equivalence (reflexive, symmetric, transitive) and equivalence produces equvalence classes. You can store number in each row that would mark the equivalence class. Rows with the same number would be equal. This is what Shafizadeh proposed and I disagree with @Drew that it is the last thing to do. It can be good if used in the right case.
Upvotes: 0
Reputation: 24970
For Option 1 at least, please see Edit 1
at bottom for the actual INSERT
strategy for the intersect.
Option 1
create table tags
( id INT AUTO_INCREMENT PRIMARY KEY,
name VARCHAR(60) NOT NULL,
UNIQUE KEY `key_tags_name` (name)
-- All the other columns
)ENGINE=InnoDB;
create table tagIntersects
( id1 INT NOT NULL,
id2 INT NOT NULL,
PRIMARY KEY(id1,id2),
KEY `ti_flipped` (id2,id1), -- flipped left-mode (thin size)
FOREIGN KEY `fk_ti_id1` (id1) REFERENCES tags(id),
FOREIGN KEY `fk_ti_id2` (id2) REFERENCES tags(id)
)ENGINE=InnoDB;
Option 2
create table tags
( id INT AUTO_INCREMENT PRIMARY KEY,
name VARCHAR(60) NOT NULL,
UNIQUE KEY `key_tags_name` (name)
-- All the other columns
)ENGINE=InnoDB;
create table tagIntersects
( id INT AUTO_INCREMENT PRIMARY KEY,
name1 VARCHAR(60) NOT NULL,
name2 VARCHAR(60) NOT NULL,
KEY `ti_Flipped` (name2,name1), -- these get costly (wide)
FOREIGN KEY `fk_ti_id1` (name1) REFERENCES tags(name),
FOREIGN KEY `fk_ti_id2` (name2) REFERENCES tags(name)
)ENGINE=InnoDB;
Option 3
create table tags
( id INT AUTO_INCREMENT PRIMARY KEY,
name VARCHAR(60) NOT NULL,
UNIQUE KEY `key_tags_name` (name)
-- All the other columns
)ENGINE=InnoDB;
create table tagIntersects
( name1 VARCHAR(60) NOT NULL,
name2 VARCHAR(60) NOT NULL,
PRIMARY KEY (name1,name2),
KEY `ti_Flipped` (name2,name1), -- these get costly (wide)
FOREIGN KEY `fk_ti_id1` (name1) REFERENCES tags(name),
FOREIGN KEY `fk_ti_id2` (name2) REFERENCES tags(name)
)ENGINE=InnoDB;
.
INSERT tags (name) VALUES ('PHP'),('PDO'),('MYSQLI'),('PHPMyAdmin');
Fake load 200 tags with random names:
DROP PROCEDURE IF EXISTS tagDataLoad;
DELIMITER $$
CREATE PROCEDURE tagDataLoad()
BEGIN
-- warning this is horribly slow
DECLARE i INT DEFAULT 0;
WHILE i<200 DO
INSERT IGNORE tags(name)
select concat(substring('ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789', rand()*36+1, 1),
substring('ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789', rand()*36+1, 1),
substring('ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789', rand()*36+1, 1),
substring('ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789', rand()*36+1, 1),
substring('ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789', rand()*36+1, 1),
substring('ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789', rand()*36+1, 1),
substring('ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789', rand()*36+1, 1),
substring('ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789', rand()*36+1, 1)
);
-- poach the above from Gordon, link: https://stackoverflow.com/a/16738136
SET i=i+1;
END WHILE;
END$$
DELIMITER ;
call it:
CALL tagDataLoad(); -- load 200 fake tags
select * from tags; -- eyeball it
CALL tagDataLoad(); -- load more
CALL tagDataLoad(); -- load more
SELECT MIN(id),MAX(id),count(*) FROM tags;
-- 1 604 604
Fake load qty iCount
number of fake tag intersects:
DROP PROCEDURE IF EXISTS tagIntersectDataLoad;
DELIMITER $$
CREATE PROCEDURE tagIntersectDataLoad(iCount INT)
BEGIN
-- warning this is horribly slow
-- don't pass a number greater than 100 until you time it
DECLARE i INT DEFAULT 0;
WHILE i<=iCount DO
INSERT IGNORE tagIntersects(id1,id2)
SELECT FLOOR(RAND()*600)+1,FLOOR(RAND()*600)+1;
SET i=i+1;
END WHILE;
END$$
DELIMITER ;
CALL tagIntersectDataLoad(100);
-- slow, I don't recommend a number above 100 until you time it
After changing some 100's to larger counts I ended up with 10k row
select count(*) from tag Intersects;
-- 9900
I don't recommend you do that due to timeouts. But in the end I had the above
Half the reason for the fake load stored procs above are for just getting the table size high enough that indexes are even used. They aren't used for small tables. Also they give you a method to chg to other schemas and load data special for them. And then to profile the performance with your queries with 100k rows or tens of millions (depending on your needs).
See EXPLAIN
plan:
explain
select * from tagIntersects where id1=111 or id2=111;
+----+-------------+---------------+-------------+--------------------+--------------------+---------+------+------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+---------------+-------------+--------------------+--------------------+---------+------+------+----------------------------------------------+
| 1 | SIMPLE | tagIntersects | index_merge | PRIMARY,ti_flipped | PRIMARY,ti_flipped | 4,4 | NULL | 27 | Using union(PRIMARY,ti_flipped); Using where |
+----+-------------+---------------+-------------+--------------------+--------------------+---------+------+------+----------------------------------------------+
explain
select * from tagIntersects where (id1=111 or id2=111) and (id1=500 or id2=500);
+----+-------------+---------------+-------------+--------------------+--------------------+---------+------+------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+---------------+-------------+--------------------+--------------------+---------+------+------+----------------------------------------------+
| 1 | SIMPLE | tagIntersects | index_merge | PRIMARY,ti_flipped | PRIMARY,ti_flipped | 4,4 | NULL | 21 | Using union(PRIMARY,ti_flipped); Using where |
+----+-------------+---------------+-------------+--------------------+--------------------+---------+------+------+----------------------------------------------+
The EXPLAIN
plan looks good above for typical queries. Note the thin key sizes (8 bytes total). The plan shows an index_merge
/ UNION
of two-left most key usages: one with the PK and one with the flipped secondary index. That is the point of ti_flipped
.
Also note that the FK keysizes are thin.
Note that tags.name
can be readily updated from 'node.js' to 'nodejs' with no impact to the tags Primary Key. And that update would have zero impact on tagsInserted
columns or keys.
Concerning using Option 2 or 3: the keys are wide. Changes to tags.name
would have PK and FK changes as impacts not endured by Option 1 in the intersect table. Depending on your data size (say, something different thing than SO tags), with tens of millions of rows and thousands of intersects for one name
, that change could be felt in the UX. For small to medium size, not much to worry about but Profile the impact.
So generally I opts for an Option 1 approach due to the enormity of my data sets and seeking to keep keys for relationships thin.
Spencer7593 mentioned in a comment recently that large varchars have negative impact on internal memory during joins and with an impact of subqueries / derived's that manifest in temporary tables. Another reason for thin FK's.
So this has as much to do with the readers in general who have different schemas and think there is no impact on performance with them.
So profile the performance of your queries before you finalize on a schema (of huge tables especially).
create table tags
( id INT AUTO_INCREMENT PRIMARY KEY,
name VARCHAR(60) NOT NULL,
UNIQUE KEY `key_tags_name` (name) --
-- All the other columns
)ENGINE=InnoDB;
create table tagIntersects
( id1 INT NOT NULL,
id2 INT NOT NULL,
PRIMARY KEY(id1,id2),
KEY `ti_flipped` (id2,id1),
FOREIGN KEY `fk_ti_id1` (id1) REFERENCES tags(id),
FOREIGN KEY `fk_ti_id2` (id2) REFERENCES tags(id)
)ENGINE=InnoDB;
load:
insert tags(name) values ('tag1'),('tag2'); -- ids 1 and 2
Now I have two id's in some programming language I want to intersect the tags.
Using MySQL as a programming language, let's just call them the following vars:
set @myId1=2; -- actually run this so it is not NULL
set @myId2=1; -- actually run this so it is not NULL
Note that they could be reversed, it does not matter. Now assuming you did not screw up on the programming language such that @myId1 = @myId2 (note the below will still work, but just sayin')
insert tagIntersects(id1,id2) select LEAST(@myId1,@myId2),GREATEST(@myId1,@myId2); -- ok
insert tagIntersects(id1,id2) select LEAST(@myId1,@myId2),GREATEST(@myId1,@myId2); -- GOOD it failed
-- flip em:
set @myId1=1; -- actually run this so it is not NULL
set @myId2=2; -- actually run this so it is not NULL
insert tagIntersects(id1,id2) select LEAST(@myId1,@myId2),GREATEST(@myId1,@myId2); -- GOOD it failed
Your data stays clean. Clean means you would not have two rows in there that dupe up and pollute your data such as an intersect row for MYSQL
/ SQL-SERVER
... and another row for SQL-SERVER
/ MYSQL
in id1
, id2
respectfully.
Question from user Shafizadeh
: Ok, you have three tags, tag1, tag2, tag3 .. they are related. So there is three rows in the tagIntersects table like these: tag1,tag2, tag1,tag3, tag2,tag3. Right? Now I want to select all related tags with tag1. Write the query ... :-) seems like a nightmare, huh?
Answer:
explain select * from tagIntersects where id1=2 or id2=2;
+----+-------------+---------------+-------------+--------------------+--------------------+---------+------+------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+---------------+-------------+--------------------+--------------------+---------+------+------+----------------------------------------------+
| 1 | SIMPLE | tagIntersects | index_merge | PRIMARY,ti_flipped | PRIMARY,ti_flipped | 4,4 | NULL | 37 | Using union(PRIMARY,ti_flipped); Using where |
+----+-------------+---------------+-------------+--------------------+--------------------+---------+------+------+----------------------------------------------+
My question back to you is, with your CSV, what is your explain plan? It would look awful and a tablescan.
Upvotes: 2