SeanD
SeanD

Reputation: 147

Table columns question - duplicate names with IDs or no?

I am not too familiar with queries but here is the question: My 'neighbourhood' table has columns:

n_id, name, country_id, continent_id, city_id. 

Where n_id = PK and country_id, continent_id, city_id are FKs to their own tables.

Sample data is:

34, Brooke, 23, 3, 1456

This output is good for data relationships but not for user output. On the user side when they see Brooke on the website it should be; Brooke, New York - USA. (So in essence: Brooke, 1456 - 23).

The question is: if I store only IDs in the neighborhood table then I have to join 2 tables each time to pull the names of the IDs. So to avoid this it is better to store the names again as a duplicate in the table so the columns will be:

n_id, name, country_id, country_name, continent_id, city_id, city_name

What is the performance difference with both ways? Or the advantages or disadvantages?

** Site is a social network if it helps.

Upvotes: 0

Views: 203

Answers (5)

Jonathan Leffler
Jonathan Leffler

Reputation: 753970

The major disadvantage of your proposed denormalized design is that the correct referential integrity constraints and update actions become excessively complicated. If the data associated with City_ID 1456 changes, you not only have to change the one row in the City table, you also have to change the stored value in each of the NBighbourhood rows that cross-references the City table (contains City_ID = 1456). These 'update anomalies' are the primary reason for the discipline of normalization.

Performance is tricky to measure; it will depend on the DBMS and the size of the tables, but it is conceivable that joining smaller tables will be quicker than having to scan all of a big table (where the 'big table' is your proposed revised Neighbourhood table swollen up by all the extra data that you want to add). Unless you index both the City_ID and City_Name columns in the Neighbourhood table, for example (thus using more space in indexes than you would with the normalized design), your scans for all the people in a single city may take longer (because the whole of the Neighbourhood table has to be read sequentially) than an indexed lookup on the City table to find the City_ID, and then an index scan for precisely the correct City_ID in the Neighbourhood table.

So, it is possible that your denormalized, 'efficient' design ends up being less efficient by each of several different measures than your current normalized design.

If you do decide to change from your current normalized design, make sure you measure the true costs - in complexity of update operations and disk space used as well as in query speed. It might still be better to denormalize; it is not necessarily a slam-dunk. And your cost equation may change over time, as the size of the data changes, so you may need to do your measurements multiple times.

Upvotes: 0

Paul Keister
Paul Keister

Reputation: 13077

When you add duplicate names in the neighborhood table, you are de-normalizing it. De-normalization will make queries faster, especially if the load on your system is very high. But denormalization comes at a cost, because you must write and maintain additional code to keep your redundant data in sync.

I would keep 2 things in mind:

  1. As a general rule, never optimize something until you have demonstrated a need to optimze it (Abrash's rule #1)
  2. If you find that your joins need to be faster, the first optimization to try is to tune your indexes. This will allow you to have fast joins without losing the benefit of a normalized design.

Upvotes: 1

Joe Hanink
Joe Hanink

Reputation: 4847

Don't repeat data in your transactional database.

Normalize properly, and if you are concerned about the join performance, you can tune your queries accordingly by adding indexes where appropriate, ordering the join conditions, etc. There are tools that help understand the query plan chosen by the vendor. Also note that modern databases do an excellent job of optimizing queries, such as choosing to join tables that filter out more data first so that additional join conditions are less expensive.

Data warehouses, on the other hand, often copy data to optimize reporting on unchanging historical information.

Upvotes: 0

Karl
Karl

Reputation: 3372

As a general rule it is best to normalise your data correctly and then de-normalise to resolve a specific performance problem. Are you having performance problems? Can you set up the ID-only option and test it?

Your first table design has all the normal advantages of normalise data (Google Insertion, Deletion and Update Anomaly). If you have the name (as well as the ID or without the ID in the neighbourhood table you have to have a process that ensure it is always the same (e.g. selected from a prepopulated dropdown not keyed in, etc) and a method to update it if a name changes, etc.

If you have a real performance problem this may be worth the extra effoprt. Otherwaise remain with your normalised solution.

Upvotes: 0

Sarfraz
Sarfraz

Reputation: 382706

The most important rule to database design is that data should not repeat. Your design as it is looks fine. Don't be bothered by the fact that you have to write two joins to fetch data you need to display, there are queries ranging to some hundreds of lines :)

As for performance, data coming from a single table is naturally faster than coming from two tables.

Upvotes: 0

Related Questions