Reputation: 41
I have read many strong statements here and elsewhere on the subject of storing arrays in mysql. The rules of normalization seem to suggest its a bad idea and searching within the stored array fosters inelegant code. HOWEVER, for the application I am working on it seems like a reasonable solution to store an array in a field. I'm sure that is what everyone wrongly thinks in this position but I can't figure out a better way. Here is the setup:
I have a series of tables that store registered students, courses they can take and their performance on each course. All are "normalized" to avoid duplication and errors. I want to be able to generate a "myCourses" section so after login the student sees courses they are eligible for and courses they have taken but are free to review. The approach that comes to mind is two arrays; my_eligible_courses and my_completed_courses. On registration, the student is given a set of courses for which they are eligible. This could be stored as rows where there are multiple occurrences of studentid, one for each course they can take:
student1 course 1 student1 course 2 student1 course n
The table could then be queried for all of student 1's eligible courses and displayed as a list when the student logs in.
Alternately, studentid could be a primary key and in a column "eligible_courses" there would be an array (course 1,course 2, course n).
There is a table for student performance, to record every course taken and metrics associated with student performance. It will be queried to report on student performance, quality of course etc but this table will grow quite large. I'm having a hard time believing that the most efficient way to generate a list of my_completed_courses is to query this table by studentid every time they login just to give them a list of completed courses.
One other complication is that the set of courses a student is eligible is variable and expanding as new courses are developed, which to me seems to suggest that generating a set of new columns for each new course is a bad idea-for example, new course_name, pretest_score, posttest_score, time_to_complete, ... Also, a table for each new course seems like a complicated solution for the relatively mundane endpoint of generating a simple set of lists.
So to restate the question, is it better to store "inelegant" arrayed list of eligible and completed courses in a registered student table or dynamically generate these lists?
I'm guessing this is still too vague but any discussion of db design that gives an example of an inelegant array vs a restructured schema would be appreciated.
Upvotes: 1
Views: 2823
Reputation: 52107
This is a bad idea for two obvious reasons:
X What's to stop a buggy application from storing a non-existent ID in array? Or deleting a course that is still referenced by students? Even if your application is careful about course deletion, there is no way to do it efficiently - you'll need a full table scan to examine all arrays.
Why are you even trying this? A link (aka. junction) table would solve these problems, for a moderate cost of some additional storage space.
If you are really concerned about storage space, you could even switch the DBMS and use one that supports leading-edge index compression (such as Oracle).
I'm having a hard time believing that the most efficient way to generate a list of my_completed_courses is to query this table by studentid every time they login just to give them a list of completed courses.
Databases are very good at querying humongous amounts of data. In this case, if you use the clustering properly, the DBMS will be able to get this data in very few I/O operations, meaning very fast. Did you perform any actual benchmarks? Have you measured any actual performance problem?
Also, a table for each new course seems like a complicated solution for the relatively mundane endpoint of generating a simple set of lists.
Generating a new table may be justified in case it will have different columns. But, that doesn't sound like what you are trying to do.
It seems to me that you simply need:
CHECK (
(COMPLETED = 0 AND (performance fields) IS NULL)
OR (COMPLETED = 1 AND (performance fields) IS NOT NULL)
)
(BTW, you could even omit COMPLETED altogether and just rely on testing the performance fields for NULL.)
InnoDB tables are clustered, which means that rows in STUDENT_COURSE belonging to the same student are stored physically close together, which means that getting courses of the given student is extremely fast.
If you need to go in the opposite direction (get students of a given course), add an index on same fields but in opposite order: {COURSE_ID, STUDENT_ID}. You might even consider covering in this case.
Since we are talking about small number of rows, leaving COMPLETED unindexed is just fine. If you are really concerned about that, you can even do something like:
The COMPLETED_STUDENT_COURSE is a B-Tree only for completed courses (and essentially a subset of STUDENT_COURSE which is a B-Tree for all enrolled courses).
Upvotes: 1
Reputation: 5582
Here are a few thoughts that I believe may assist you in making a good decision.
Generally, it is a rule to use correctly normalised tables. But there can be exceptions to this. Perhaps your project may be such.
Most of the time, new developers tend to focus on getting the data into a DB. They get stuck when it comes to retrieving it for a specific purpose. So given both cases of arrays vs. relational tables, ask your self if either method serves your purpose. For example, if you wanted to list the courses of student X, your array method is just fine. This is because you can retrieve it by the primary key like a student ID. But if you wanted to know how many students are on course A, the array method will be a horrible way to go.
Then again, the above point would depend on your data volume as well. For example, if you only have about a hundred students, you'll probably not notice a difference in performance. But if you're looking at several thousand records and you have a big list of courses for students, the array approach is not the way to go.
Benchmark. This is the best way for you to find out your answer. You can use MySQL's explain or just time it using your program that executes the queries. Try each method with your standard volume of data and see which one works best. For example, in the recent past, MySQL was boasting about their strength of the ISAM engine. Then I had to work on a large application that involved millions of records. And here, I noticed that each time a new record came in, Indexes had to be rebuilt. So now we had to bend the rules. Likewise, you'd better do your tests with the correct volumes of data and make a better decision.
But do not take this example as a rule. Rather, go by the standards of normalisation and only bend the rules for exceptions.
Upvotes: 0
Reputation: 5919
You should feel confident that if you have indexes on your tables for the appropriate columns, querying for my_completed_courses
will be pretty snappy.
When your table grows to the point that you notice slowdown, you can configure your MySQL server with appropriate memory allocation settings so that it can keep more data cached in memory. Or you could look into that now.
In response to the edit you made about adding new courses: Don't add a new column for each course. Don't add a new table for each course. Create a table for courses, and add rows for each course.
You should then be able to join your tables together on indexed columns to generate the list of data you need.
Upvotes: 1