Shawn
Shawn

Reputation: 19793

Optimized Table structure for Tags table

Consider these 3 table structures. Which will perform these queries the best.

Structure 1 - TagID as int with a join table

Article
-------
ArticleID int

Article_Tag
------------
ArticleTagID int
ArticleID int
TagID int

Tag
---
TagID int
TagText varchar(50)

Structure 2 - Tags only in Join table as string

Article
-------
articleID int

Article_Tag
-----------
articleTagID int
articleID int
tag varchar(50)

Structure 3 - Tag as text as the PK

Article
-------
ArticleID int

Article_Tag
------------
ArticleTagID int
ArticleID int
Tag varchar(50)

Tag
---
Tag varchar(50)

Sample queries:

Select articleID from Article a inner join Article_tag at on a.articleID = at.articleID and tag = 'apple'
Select tag from Tags -- or, for structure 2
Select distinct tag from article_tag

Upvotes: 4

Views: 2329

Answers (7)

yogman
yogman

Reputation: 4131

A table with an AUTO_INCREMENT PK won't scale. Forget about TagID as INTEGER, and replace its type as BINARY(16), just enough for the MD5 checksum of TagText.

And with a proper cache layer, your SQL query won't need the TagText column as much as it needed.

Upvotes: 0

dmajkic
dmajkic

Reputation: 3488

I'd go for Structure 2, maybe calling Article_Tag table simply - Tags.

Upvotes: 0

IDisposable
IDisposable

Reputation: 1868

You should be mapping the TagText to TagId in code (and mapping to an in-memory cache anyway) and passing the pre-mapped TagId into your query.

Also there's no reason you need a synthetic key for the Article_Tag table. You should be using a composite-primary-key (ArticleId, TagId).

So, I say #1 with minor tweak mentioned above.

Upvotes: 0

Draemon
Draemon

Reputation: 34711

I'd go with 1 every time. It's fully normalised and since you're using a synthetic PK, you can change the name of a tag with a single row update.

The only advantage otherwise is reducing the number of joins. This is an optimisation, which we all know you should only make after measurement. If you were sure structure 1 wasn't fast enough you wouldn't be asking, right?

Now there isn't a whole lot of difference between 2 and 3, but as Bill Karwin notes, 3 has advantages in terms of cascading updates. More to the point the extra table loses you nothing.

So I would say go with 1. If there is a measurable (ie Provable) performance problem then 3 would be perfectly acceptable. It wouldn't be very hard to migrate later anyway.

Upvotes: 2

Michael Stum
Michael Stum

Reputation: 180914

Welcome back, modesty. Or Localghost, or Shawn or however you call yourself now. Just keep in mind that there is no Hackers badge anymore, so nothing to win here :)

Upvotes: 0

Marc Gravell
Marc Gravell

Reputation: 1062600

It depends if you ever want to be able to change the tag-text globally. You could, of course, issue a wide UPDATE on Article_Tag, but if you need to be able to do this, then being able to just update the value in Tag would be simpler. Some servers offer automatic updates (such as ON UPDATE CASCADE in SQL Server), but these aren't necessarily cheap (it still has to UPDATE a lot of rows and any indexes defined).

But if you don't need this, then it should be a bit quicker with the literal in Article_Tag, as it can remove a join - a lot of times. Obviously, index it etc.

The additional space required for the repeated literal is a factor, but disk space is usually cheaper than a faster server.

As for being a primary key; unless you have other data to store, why would you even need the table any more? You could use DISTINCT on Article_Tag just as easily, especially as Tag should be indexed (so this should be pretty cheap). (edit Bill Karwin correctly points out the merits of being able to keep eligible tags, not just current tags).

Upvotes: 5

Bill Karwin
Bill Karwin

Reputation: 562260

Using TagText as the primary key would have the advantage that you could get an articles tags with fewer joins:

SELECT * FROM Article_Tag WHERE Article_ID = ?

It would have the disadvantage that tag strings take more space than integers, so the storage for Article_Tag and its indexes would be larger. This occupies more space on disk and also requires more memory for caching the index.

Upvotes: 4

Related Questions