StackOverMySoul
StackOverMySoul

Reputation: 2037

SQL - Survey Data, Table Schema Design for looped survey questions

Suppose we have a survey where some of the questions are asked across multiple entities.

For example:
Car Brands = [Brand 1, Brand 2, Brand 3, Brand 4...]

This questions will be asked for each one of the car brands (looped).
Question Q01 = (Scale 1-10) Do you think [Car Brand] cars are reliable?
Question Q02 = (Scale 1-10) Do you think [Car Brand] cars are a good value?
...

I'm designing a schema that will power some web based analytic tools, so query performance is important.

The schema will be 3 tables: Records, Questions, Answers

I have two approaches for the answers table:

A) Table: Answers

QuestionId | AnswerValue | BrandOption 
   Q01     |      7      |      1
   Q01     |      5      |      2
   Q01     |      4      |      3
   Q01     |      8      |      4

B) Table: Answers

QuestionId | AnswerValue
  Q01-1    |     7
  Q01-2    |     5
  Q01-3    |     4
  Q01-4    |     8

The queries can be either for one brand at a time or for all the brands, with equal priority for both queries.

Option A seems to give me some advantages if I ever need to do something like a group by, however if most of the queries are for a specific brand, then Option B seems to be more efficient.

Thoughts?

Upvotes: 0

Views: 333

Answers (2)

Zohar Peled
Zohar Peled

Reputation: 82474

Option A is better, even if you don't see it right now.
Storing multiple values in a single database "cell" is a mistake any way you look at it (though unfortunately, a very common mistake) - not to mention it's a violation of the first normal form - which specifically states that each column can only contain a single atomic value in each row (though the original rule is using a different terminology).

The disadvantages are numerous and some of them are critical, including (but not limited to):

  • You loose the ability to use the proper data type - two ints stored together must be stored as a different data type than int.
  • You might loose the ability to verify your data is, in fact, correct, or that the different parts can be converted to the correct data type (most databases supports check constraints nowadays but not all (Yes, MySql, I'm pointing my finger at you!))
  • You loose the ability to enforce uniqueness on each parts of the data separately.
  • You can't use the different parts of the data as basis for foreign key constraints

The list goes on and on - but I think anyone should get the picture by now - a database column should be used to store a single value for each row - every time.

Upvotes: 2

Gordon Linoff
Gordon Linoff

Reputation: 1269593

The first version is preferable in my opinion. It makes it easier to look for answers to different questions for a single brand and to the same question across brands.

Munging the question id seems like a poor substitute. For one thing, it precludes simple foreign key relationships to a questions table and to a brands table. I'm a big fan of explicit foreign key relationships.

Of course, to make this work, you will need a method to store "no brand" or "brand no relevant". One method is to use NULL for such answers.

Upvotes: 0

Related Questions