Sam
Sam

Reputation: 721

Database Design: Composite key vs one column primary key

The database of our web application includes two tables:

States (idStates, State, Lat, Long)

idStates is an auto-incrementing primary key.

Cities (idAreaCode, idStates, City, Lat, Long)

idAreaCode is a primary key consisting of country code + area code (e.g. 91422 where 91 is the country code for india and 422 is the area code of a city in India). idStates is a foreign key referencing States to associate each city in Cities with its corresponding state.

We figured that country code + area code would be unique for each city, and thus could safely be used as a primary key. Everything was working. But a location in India found a flaw in the DB design--India, like the US, is a federal democracy and is geographically divided into many states or union territories. Both the states and union territories data is stored in States. There is, however, one location--Chandigarh--which belongs to TWO states (Haryana and Punjab) and is also a union territory by itself.

The current db design doesn't allow us to store more than one record for the city Chandigarh.

One solution suggested is to create a primary key combining columns idAreaCode and idStates.

What is a solution?

We are using MySQL with the InnoDB engine.

The database stores meteorological information for each city. Thus, the state and city are the starting point of each query.

Database normalization is important to us.

The database is updated daily / hourly using a CSV file (which is generated by another app). Each record in the CSV file is identified by the idStates and idAreaCode columns.

Hence it is preferred that the primary key used in City, rather than be auto-incremented, is the same for every city, even if the table is deleted and refreshed again. Zip codes (or pin codes) and area codes (or STD codes) meet the criteria of being unique and static (don't change often) and a ready list of these are easily available. (We decided on area codes for now because India is in the process of updating its pin codes to a new format.)

PS: We decided to handle this at the application level instead of making changes to the database design. In the database we will only be storing one record for Chandigarh. In the application we created a flag for any search for Chandigarh Punjab or Chandigarh Haryana to redirect to this record. It's an acceptable compromise since this is the ONLY exception we've come across.

Upvotes: 2

Views: 7842

Answers (8)

Magician
Magician

Reputation: 2123

Add another table, Countries.

Your problem is an example why database normalization is important. You can't just mix and match different keys to one column.

Countries

+------------+--------------+
| country_id | country_name |
+------------+--------------+

States

+------------+----------+------------+
| country_id | state_id | state_name |
+------------+----------+------------+

Cities

+------------+----------+---------+-----------+
| country_id | state_id | city_id | city_name |
+------------+----------+---------+-----------+

Data

+------------+----------+---------+---------+----------+
| country_id | state_id | city_id | data_id | your_CSV |
+------------+----------+---------+---------+----------+

The bold fields are primary keys. Enter a standard country_id like 1 for US, 91 for India, and so on. city_id should also use their standard id.

You can then find things belonging to each other fast with minimal overhead. All data can be entered directly into Data, which is thus serving as one entry point, storing all the data in a single spot. If your database supports partitioning, you can partition tables according to country_id or country_id+state_id to a couple of server arrays; thus it will also speed up your database performance considerably. Contries, States and Cities won't make much of a hit on server load and will only serve as reference. You will mainly be working on Data. You can add data as much as you wish, without any duplicates.

If you only have one datum per city, you can omit Data and move CSV_data to Cities:

Cities

+------------+----------+---------+-----------+----------+
| country_id | state_id | city_id | city_name | CSV_data |
+------------+----------+---------+-----------+----------+

Upvotes: 2

mP.
mP.

Reputation: 18266

Introduce a surrogate key.

What are you going to do when area codes change numbers or get split? Using a business key as a primary key almost always is a mistake.

Your summary is another example of why.

Upvotes: 1

Adam Robinson
Adam Robinson

Reputation: 185703

If you add a column to the key so that you can add an additional record for a given city then you're not properly normalizing your data. Given that a city can be a member of multiple states, remove any reference to a state from Cities and add a StateCity table that relates states and cities (creating a m:m relationship).

Upvotes: 1

Roee Adler
Roee Adler

Reputation: 34010

Add a new primary key field to the Cities table that will be simply auto-incremental. The KISS methodology (keep it simple).

Any other solution is cumbersome and confusing.

Upvotes: 0

PerformanceDBA
PerformanceDBA

Reputation: 33818

  1. The database is not Normalised. It may be partly Normalised. You will find many more bugs and limitations in extensibility, as a result.

  2. A hierarchy of Country then State then City is fine. You do not need a many-to-many additional table as some suggest. The said city (and many in America) is multiply in three States.

  3. By placing CountryCode and AreaCode, concatenated, in a single column, you have broken basic database rules, not to mention added code on every access. Additionally, CountryCode is not Normalised.

  4. The problem is that CountryCode+AreaCode is a poor choice for a key for a City. In real terms, it has very little to do with a city, it applies to huge swaths of land. If the meaning of City was changed to town (as in, your company starts collecting data for large towns), the db would break completely.

  5. Magician has the only answer that is close to being correct, that would save you from your current limitations due to lack of Normalisation. It is not accurate to say that Magician's answer is Normalised; it is correct choice of Identifiers, which form a hierarchy in this case. But I would remove the "id" columns because they are unnecessary, 100% redundant columns, 100% redundant indices. The char() columns are fine as they are, and fine for the PK (compound keys). Remember you need an Index on the char() column anyway, to ensure it is unique.

    • If you had this, the Relational structure, with Relational Identifiers, your problem would not exist.
    • and your poor users do not have to figure silly things out or keep track of meaningless identifiers. They just state, naturally: State.Name, City.Name, ReadingType, Data ... .
  6. When you get to the lower end of the hierarchy (City), the compound PK has become onerous (3 x CHAR(20) ), and I wouldn't want to carry it into the Data table (esp if there are daily CSV imports and many readings or rows per city). Therefore for City only, I would add a surrogate key, as the PK.

  7. But for the posted DDL, even as it is, without Normalising the db and using Relational Identifiers, yes, the PK of City is incorrect. It should be (idStates, idAreaCode), not the other way around. That will fix your problem.

Very bad naming by the way.

Upvotes: 0

"We figured that the country code + area code combination would be unique for each city, and thus could safely be used as a primary key"

After having read this, I just stopped to read anything further in this topic. How could someone figure it in this way?
Area codes, by definition (the first one I found on internet):
- "An Area code is the prefix numbers that are used to identify a geographical region based on the North American number Plan. This 3 digit number can be assigned to any number in North America, including Canada, The United States, Mexico, Latin America and the Caribbean" [1]

Putting aside that they are changeable and defined only in North America, the area codes are not 3-digits in some other countries (3-digits is simply not enough having hundred thousands of locations in some countries. BTW, my mother's area code has 5 digits) and they are not strictly linked to fixed geographical locations.

Area codes have migrating locations like arctic camps drifting with ice, normadic tribes, migrating military units or, even, big oceanic ships, etc.

Then, what about merging a few cities into one (or vice versa)?

[1]
http://www.successfuloffice.com/articles/answering-service-glossary-area-code.htm

Upvotes: 1

Walter Mitty
Walter Mitty

Reputation: 18950

It sounds like you are gathering data for a telephone directory. Are you? Why are states important to you? The answer to this question will probably determine which database design will work best for you.

You may think that it's obvious what a city is. It's not. It depends on what you are going to do with the data. In the US, there is this unit called MSA (Metropolitan Statistical Area). The Kansas City MSA spans both Kansas City, Kansas and Kansas City, Missouri. Whether the MSA unit makes sense or not depends on the intended use of the data. If you used area codes in US to determine cities, you'd end up with a very different grouping than MSAs. Again, it depends on what you are going to do with the data.

In general whenever hierarchical patterns of political subdivisions break down, the most general solution is to consider the relationship many-to-many. You solve this problem the same way you solve other many-to-many problems. By creating a new table, with two foreign keys. In this case the foreign keys are IdAreacode and IdStates.

Now you can have one arecode in many states and one state spanning many area codes. It seems a shame to accpet this extra overhead to cover just one exception. Do you know whether the exception you have uncovered is just the tip of the iceberg, and there are many such exceptions?

Upvotes: 4

Aviad Ben Dov
Aviad Ben Dov

Reputation: 6409

Having a composite key could be problematic when you want to reference that table, since the referring table would have to have all columns the primary key has.

If that's the case, you might want to have a sequence primary key, and have the idAreaCode and idStates defined in a UNIQUE NOT NULL group.

Upvotes: 2

Related Questions