dojogeorge
dojogeorge

Reputation: 1704

Database of users with tags

Im fairly new to database design and im trying to work out how the following schema will work.

Id like to have a database of users, which ideally would be in one table. Then for each user I need to store a weighted list of tags, that can be extended later on. For example:

User1: dnb (+3), dubstep (+1), classical (-2), rnb (+1)

User2: rnb (+2), hiphop (+4), jazz (-3), classical (-1)

Where the bracketed number is the 'weight' of tag for each user. Id like to be able to add more tags to each user at any time as well

Upvotes: 1

Views: 139

Answers (1)

Daniel
Daniel

Reputation: 1072

This is fairly simple. You want to represent information about users and tags, with each user having one or more tags and with each tag having a weight. Given that information you have a many-to-many cardinality and you could represent it with three tables:

Users(UserID, Name, Surname, DateOfBirth);
Tags(TagID, TagName);
UsersAndTags(UserID, TagID, TagWeight);

If you're confused about why this is a many-to-many relationships:

  • The user can have one or more tags associated to himself/herself.
  • A tag can be associated with many users.

The diagram would look something like this:

  • User * ---------------- * Tag

I'll provide a more detailed explanation below.


You can represent everything in a relation such as:

+--------+------+---------+-------------+-----+-----------+
| UserID | Name | Surname | DateOfBirth | Tag | TagWeight |
+--------+------+---------+-------------+-----+-----------+

But this is a bad relation because you are representing heterogeneous information inside a single table and doing so you will end up with inconsistencies. Consider the following (with primary key UserID and Tag):

+--------+------+--------------------+-------------+--------+-----------+
| UserID | Name |      Surname       | DateOfBirth |  Tag   | TagWeight |
+--------+------+--------------------+-------------+--------+-----------+
|   1230 | Ana  | Patson             | 12/01/1980  | music  | -1        |
|   2300 | Mike | Johnson            | 01/03/1979  | art    | +3        |
|   2300 | Mike | Johnson            | 01/03/1979  | sports | +1        |
|   2300 | Mike | Johnson            | 01/03/1979  | hiphop | -4        |
|   2300 | Mike | Johnson            | 01/03/1979  | rnb    | -2        |
|   1230 | Ana  | Patson             | 12/01/1980  | rnb    | +1        |
|   1230 | Ana  | Patson             | 12/01/1980  | hiphop | +3        |
|   1230 | Ana  | Patson             | 12/01/1980  | dnb    |  0        |
+--------+------+--------------------+-------------+--------+-----------+

We have various inconsistencies in this table, in particular:

  • redundancies: we have data that is repeated multiple times
  • deletion anomalies: if we were to delete the fact that user Ana with ID 1230 has no tags associated with herself then we would have to completely remove her from the database since Tag is part of our primary key. This is probably not what we want.
  • update anomanies if Ana gets married and takes the surname of her husband then we have to update all the rows corresponding to her entry multiple times.
  • insert anomalies we can't insert a new user that has no tag

Normalization is a process by which we remove the inconsistencies named above. The most common are: the the first normal form (abbreviated 1NF), second normal form (2NF), third normal form (3NF), Boyce-Codd normal form (BCNF). There is also a fourth normal form (and a fifth and a sixth) which I'm not getting into here.

A table is in first normal form if all the attributes (i.e. columns) are atomic. What this means is that each attribute must represent a single fact and not an aggregation of facts. Our table is in the 1NF.

Second normal form deals with composite primary keys (i.e. a primary key with more than one attribute, what we have) and states that a relation is in 2NF if each other attribute (i.e. attributes that are not part of the composite primary key) must depend on the entire primary key. Let's look at our table. At first glance we might think that this relation conforms with 2NF because given the composite primary key {UserID, Tag} we can determine all other attributes in the table. However, Name, Surname and DateOfBirth depend only on the UserID and not on Tag which is part of our composite key as well. Therefor this relation does not conform with 2NF.

Third normal form states that a relation is in 3NF if every attribute in the table depends directly on the key. This eliminates the fact that an attribute can depend on another attribute which itself depends on the key (i.e. transitivity rule which is allowed by the 2NF). We already said our table does not conform with 2NF and therefor can't conform with 3NF.

The normal forms are cumulative, which means that if our table is in 3NF then it's also in 2NF and 1NF.

These dependencies we've been talking about are called functional dependencies (FD). A functional dependency is written like X -> Y and you read it as X functionally determines Y (or Y is determined by X) and this simply means that the value of X determines the value of Y (where we have the same value for X we must have the same value for Y). In our case whenever we have UserID 1230 we have the Name Ana and Surname Patson and DateOfBirth 12/01/1980. We therefor have the following functional dependency:

UserID -> Name, Surname, DateOfBirth.

More formally, 3NF states that for each FD X -> Y, X is either the key (or contains a key i.e. a superset of the key) OR Y is part of the key. I mentioned this because it's an important fact about 3NF which differentiates it from BCNF. BCNF states that for each FD X -> Y X is a key.

Other functional dependencies in our relation:

UserID, Tag -> TagWeight

UserID, Tag -> All the attributes

We can see that the first FD User -> Name, Surname, DateOfBirth is not a key (UserID is not the key, only part of the key). This is a problem and this is what's causing anomalies in our relation. When we normalize our relation we decompose based on the functional dependencies. From the first FD we get (R stands for relation i.e. table):

R1(UserID, Name, Surname, DateOfBirth) 

From the second one we get:

R2(UserID, Tag, TagWeight)

Now it's much better. We represent facts about each user in a different table and facts about users and tag associations in another table. This conforms with 3NF and BCNF.

You might want to add the third table which is the R3(TagID, Tag) and replace the Tag attribute in R2 with TagID (and you get 3 tables like I said in the beginning).

I haven't touched on lossless-join decomposition which is important and I encourage you to read about. I hope this helps.

Upvotes: 2

Related Questions