Reputation: 19793
Consider these 3 table structures. Which will perform these queries the best.
Article
-------
ArticleID int
Article_Tag
------------
ArticleTagID int
ArticleID int
TagID int
Tag
---
TagID int
TagText varchar(50)
Article
-------
articleID int
Article_Tag
-----------
articleTagID int
articleID int
tag varchar(50)
Article
-------
ArticleID int
Article_Tag
------------
ArticleTagID int
ArticleID int
Tag varchar(50)
Tag
---
Tag varchar(50)
Sample queries:
Select articleID from Article a inner join Article_tag at on a.articleID = at.articleID and tag = 'apple'
Select tag from Tags -- or, for structure 2
Select distinct tag from article_tag
Upvotes: 4
Views: 2329
Reputation: 4131
A table with an AUTO_INCREMENT PK won't scale. Forget about TagID as INTEGER, and replace its type as BINARY(16), just enough for the MD5 checksum of TagText.
And with a proper cache layer, your SQL query won't need the TagText column as much as it needed.
Upvotes: 0
Reputation: 3488
I'd go for Structure 2, maybe calling Article_Tag table simply - Tags.
Upvotes: 0
Reputation: 1868
You should be mapping the TagText
to TagId
in code (and mapping to an in-memory cache anyway) and passing the pre-mapped TagId
into your query.
Also there's no reason you need a synthetic key for the Article_Tag
table. You should be using a composite-primary-key (ArticleId
, TagId
).
So, I say #1 with minor tweak mentioned above.
Upvotes: 0
Reputation: 34711
I'd go with 1 every time. It's fully normalised and since you're using a synthetic PK, you can change the name of a tag with a single row update.
The only advantage otherwise is reducing the number of joins. This is an optimisation, which we all know you should only make after measurement. If you were sure structure 1 wasn't fast enough you wouldn't be asking, right?
Now there isn't a whole lot of difference between 2 and 3, but as Bill Karwin notes, 3 has advantages in terms of cascading updates. More to the point the extra table loses you nothing.
So I would say go with 1. If there is a measurable (ie Provable) performance problem then 3 would be perfectly acceptable. It wouldn't be very hard to migrate later anyway.
Upvotes: 2
Reputation: 180914
Welcome back, modesty. Or Localghost, or Shawn or however you call yourself now. Just keep in mind that there is no Hackers badge anymore, so nothing to win here :)
Upvotes: 0
Reputation: 1062600
It depends if you ever want to be able to change the tag-text globally. You could, of course, issue a wide UPDATE
on Article_Tag
, but if you need to be able to do this, then being able to just update the value in Tag
would be simpler. Some servers offer automatic updates (such as ON UPDATE CASCADE
in SQL Server), but these aren't necessarily cheap (it still has to UPDATE
a lot of rows and any indexes defined).
But if you don't need this, then it should be a bit quicker with the literal in Article_Tag
, as it can remove a join - a lot of times. Obviously, index it etc.
The additional space required for the repeated literal is a factor, but disk space is usually cheaper than a faster server.
As for being a primary key; unless you have other data to store, why would you even need the table any more? You could use DISTINCT
on Article_Tag
just as easily, especially as Tag
should be indexed (so this should be pretty cheap). (edit Bill Karwin correctly points out the merits of being able to keep eligible tags, not just current tags).
Upvotes: 5
Reputation: 562260
Using TagText
as the primary key would have the advantage that you could get an articles tags with fewer joins:
SELECT * FROM Article_Tag WHERE Article_ID = ?
It would have the disadvantage that tag strings take more space than integers, so the storage for Article_Tag
and its indexes would be larger. This occupies more space on disk and also requires more memory for caching the index.
Upvotes: 4