Reputation: 12570

Scalable Database Tagging Schema

EDIT: To people building tagging systems. Don't read this. It is not what you are looking for. I asked this when I wasn't aware that RDBMS all have their own optimization methods, just use a simple many to many scheme.

I have a posting system that has millions of posts. Each post can have an infinite number of tags associated with it.

Users can create tags which have notes, date created, owner, etc. A tag is almost like a post itself, because people can post notes about the tag.

Each tag association has an owner and date, so we can see who added the tag and when.

My question is how can I implement this? It has to be fast searching posts by tag, or tags by post. Also, users can add tags to posts by typing the name into a field, kind of like the google search bar, it has to fill in the rest of the tag name for you.

I have 3 solutions at the moment, but not sure which is the best, or if there is a better way.

Note that I'm not showing the layout of notes since it will be trivial once I get a proper solution for tags.

Method 1. Linked list

tagId in post points to a linked list in tag_assoc, the application must traverse the list until flink=0

post:           id, content, ownerId, date, tagId, notesId
tag_assoc:      id, tagId, ownerId, flink
tag:            id, name, notesId

Method 2. Denormalization

tags is simply a VARCHAR or TEXT field containing a tab delimited array of tagId:ownerId. It cannot be a fixed size.

post:           id, content, ownerId, date, tags, notesId
tag:            id, name, notesId

Method 3. Toxi

(from: http://www.pui.ch/phred/archives/2005/04/tags-database-schemas.html, also same thing here: Recommended SQL database design for tags or tagging)

post:          id, content, ownerId, date, notesId
tag_assoc:     ownerId, tagId, postId
tag:           id, name, notesId

Method 3 raises the question, how fast will it be to iterate through every single row in tag_assoc?

Methods 1 and 2 should be fast for returning tags by post, but for posts by tag, another lookup table must be made.

The last thing I have to worry about is optimizing searching tags by name, I have not worked that out yet.

I made an ASCII diagram here: http://pastebin.com/f1c4e0e53

Upvotes: 6

Answers (4)

L̲̳o̲̳̳n̲̳̳g̲̳̳p̲̳o̲̳̳k̲̳̳e̲̳̳

Reputation: 12570

Bill I think I kind of threw you off, the notes are just in another table and there is a separate table with notes posted by different people. Posts have notes and tags, but tags also have notes, which is why tags are UNIQUE.

Jonathan is right about linked lists, I wont use them at all. I decided to implement the tags in the simplest normalized way that meats my needs:

DROP TABLE IF EXISTS `tags`;
CREATE TABLE IF NOT EXISTS `tags` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `owner` int(10) unsigned NOT NULL,
  `date` int(10) unsigned NOT NULL,
  `name` varchar(255) NOT NULL,
  PRIMARY KEY (`id`),
  UNIQUE KEY `name` (`name`)
) ENGINE=InnoDB  DEFAULT CHARSET=utf8 AUTO_INCREMENT=1 ;

DROP TABLE IF EXISTS `posts`;
CREATE TABLE IF NOT EXISTS `posts` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `owner` int(10) unsigned NOT NULL,
  `date` int(10) unsigned NOT NULL,
  `name` varchar(255) NOT NULL,
  `content` TEXT NOT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB  DEFAULT CHARSET=utf8 AUTO_INCREMENT=1 ;

DROP TABLE IF EXISTS `posts_notes`;
CREATE TABLE IF NOT EXISTS `posts_notes` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `owner` int(10) unsigned NOT NULL,
  `date` int(10) unsigned NOT NULL,
  `postId` int(10) unsigned NOT NULL,
  `note` TEXT NOT NULL,
  PRIMARY KEY (`id`),
  FOREIGN KEY (`postId`) REFERENCES posts(`id`) ON DELETE CASCADE
) ENGINE=InnoDB  DEFAULT CHARSET=utf8 AUTO_INCREMENT=1 ;

DROP TABLE IF EXISTS `posts_tags`;
CREATE TABLE IF NOT EXISTS `posts_tags` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `owner` int(10) unsigned NOT NULL,
  `tagId` int(10) unsigned NOT NULL,
  `postId` int(10) unsigned NOT NULL,
  PRIMARY KEY (`id`),
  FOREIGN KEY (`postId`) REFERENCES posts(`id`) ON DELETE CASCADE,
  FOREIGN KEY (`tagId`) REFERENCES tags(`id`) ON DELETE CASCADE
) ENGINE=InnoDB  DEFAULT CHARSET=utf8 AUTO_INCREMENT=1 ;

I'm not sure how fast this will be in the future, but it should be fine for a while as only a couple people use the database.

Upvotes: 0

Jonathan Leffler

Reputation: 754450

A linked list is almost certainly the wrong approach. It certainly means that your queries will be either complex or sub-optimal - which is ironic since the most likely reason for using a linked list is to keep the data in the correct sorted order. However, I don't see an easy way to avoid iteratively fetching a row, and then using the flink value retrieved to condition the select operation for the next row.

So, use a table-based approach with normal foreign key to primary key references. The one outlined by Bill Karwin looks similar to what I'd outline.

Upvotes: 0

Bill Karwin

Reputation: 562631

Here is how I'd do it:

posts:          [postId], content, ownerId, date, noteId, noteType='post'
tag_assoc:      [postId, tagName], ownerId, date, noteId, noteType='tagAssoc'
tags:           [tagName], ownerId, date, noteId, noteType='tag'
notes:          [noteId, noteType], ownerId, date, content

The fields in square brackets are the primary key of the respective table.

Define a constraint on noteType in each table: posts, tag_assoc, and tags. This prevents a given note from applying to both a post and a tag, for example.

Store tag names as a short string, not an integer id. That way you can use the covering index [postId, tagName] in the tag_assoc table.

Doing tag completion is done with an AJAX call. If the user types "datab" for a tag, your web page makes an AJAX call and on the server side, the app queries: SELECT tagName FROM tags WHERE tagName LIKE ?||'%'.

Upvotes: 2

duffymo

Reputation: 308918

"A tag is almost like a post itself, because people can post notes about the tag." - this phrase makes me think you really just want one table for POST, with a primary key and a foreign key that references the POST table. Now you can have as many tags for each post as your disk space will allow.

I'm assuming there's no need for many to many between POST and tags, because a tag isn't shared across posts, based on this:

"Users can create tags which have notes, date created, owner, etc."

If creation date and owner are shared, those would be two additional foreign key relationships, IMO.

Upvotes: 0

Scalable Database Tagging Schema

Answers (4)

Related Questions