user2895783
user2895783

Reputation:

Database design, handle record-specific and arbitrary amount of order-dependent data

Bear with me as I am still learning. Essentially, in abstract terms, I have a set of data which could easily fit in a 1NF or 2NF, but also have some that varies in number of items, which I want associated with a record, in which the order must be maintained. be aware, I'm not concerned with any specific database or language, just the very basic approach and theory to this problem.

To simplify to most basic elements, I have an ID, a Goal, and Tasks required in order to complete the goal. For this example, I have excluded other fields like Name (string), Section (string), and so on, as they are straightforward to handle.

At first, I figured, maybe there will only ever be 5 Tasks, as a casual glance of the data set seemed to indicate about 2-3 Tasks (strings). The order in my code was implied to be 1 -> 2 -> 3 and so on.

ID (key), Goal (string), Task1, Task2, Task3, Task4, Task5

I immediately did not like that, as half the values end up being NULL, but it sort of worked, and I was learning some other things like how to call the SQL from my scripting language. Then I started seeing Goals that had 6, 7 and 8 Tasks. :( Do I just keep randomly adding more columns as needed, and thus increase the percentage of NULLs stored? No. Not a good idea.

So I wondered, do I just cram all tasks into one field, and specify a delimiter? Then I could just use a split and join or a regexp to format the data. In this example, my tasks consist of 1-3 tokens of [A-Za-z '], so it's easy enough to handle.

ID (key), Goal (string), Tasks (string)

Where Tasks is of the form task1,task2,task3,...

Something about that just seems to bother me. What if I am working on multiple goals at the same time, and want to get a list of all Names that need the same set of tasks applied? For example, say I have:

123, "Name1", "Goal1", "task1,task2,task3,task4,task5"
456, "Name2", "Goal2", "task2,task3,task4"
789, "Name3", "Goal3", "task3,task4,task5"

How messy it becomes now to look up all records that require task3? Maybe I could just use a LIKE to find what I want? Seems like a horrible abuse of the function. Could break it all apart, handle the logic in a script, seems even more messy, inefficient, difficult to maintain. For example, making changes to all task3 entries, or changing the order of tasks, would not be good.

Shooting fish in a barrel and using knives on a cutting board could both arguably be used to make sushi...

So I wondered about putting the Tasks data in a separate table, sharing the same ID key. That's look like this.

Main Table:

123, "Name1", "Goal1"
456, "Name2", "Goal2"
789, "Name3", "Goal3"

Tasks Table:

123, "Task1"
123, "Task2"
123, "Task3"
123, "Task4"
123, "Task5"
456, "Task2"
456, "Task3"
456, "Task4"
789, "Task3"
789, "Task4"
789, "Task5"

At this point my gut feeling is that something has gone horrifically wrong with my thinking. I've lost the ability to ensure that the order is maintained. A query for all tasks needed by any specific ID could result in any order. It's also storing a lot of redundant data. At least I got rid of the NULLs? But that's no good.

At this point, something else is bugging me, which probably should have been addressed earlier in design. But I am trying to teach myself, and learning as I go. So here I go, off on a tangent.

There's a lot of redundant textual data, as these Task descriptions are constant. So I was wondering how to best optimize that, to minimize disk usage, and increase speed, without cluttering code with too much scripting overhead. One idea I had was to create a table of enumerations.

Enumerations: ID (key), Task (string)

1, Task5
2, Task4
3, Task3
4, Task2
5, Task1
6, Task10
7, Task9
8, Task8
9, Task7
10, Task6
and so on.

Well, at least instead of a string being stored everywhere, I could store a much smaller integer. Even if they were in the worst case 64-bit integers, that's 8 bytes, still smaller than the strings I'd be storing. My code would read in the enumerations, store in run time, and use that to reference the strings.

Not sure if that's a valid technique, if there's a better way to approach that problem, or even what it is called. Indexing? Or is that something different? Or is it something some databases can do automagically?

Anyways, back to the main problem, what to do with my arbitrary list of order-dependent tasks? Create 1-off tables per main record, each with it's own ORDER (key) and Task (string/int/enum) entry? Seems even worse for overhead.

It seems to me like this is a basic problem, and has a few standard ways of approaching it. On my limited budget, lack of books, slow connection, and Google endlessly sending me nowhere, I figured I'd ask for any tips. Any free online references to sources of knowledge (specific sites or articles) also welcome.

Upvotes: 2

Views: 323

Answers (2)

Branko Dimitrijevic
Branko Dimitrijevic

Reputation: 52147

Your thinking is sound and you got very close to the real solution yourself, I'll just nudge you a little further to get you there...

enter image description here

Example data:

GOAL
----
123, "Goal1"
456, "Goal2"
789, "Goal3"

TASK
----
1, 'Task1'
2, 'Task2'
3, 'Task3'
4, 'Task4'
5, 'Task5'

GOAL_TASK
---------
123, 1, 1
123, 2, 2
123, 3, 3
123, 4, 4
123, 5, 5
456, 1, 2
456, 2, 3
456, 3, 4
789, 1, 3
789, 2, 4
789, 3, 5

In relational databases, a table is a physical manifestation of a relation, which is a set, and sets are fundamentally unordered. So while the table will have some physical order1, it will be logically unordered, and the only way to guarantee the order of a query result is to use ORDER BY clause (and for that, we need an explicit column that defines the order, such as POSITION above).

The GOAL_TASK's primary key {GOAL_ID, POSITION} ensures no two tasks can occupy the same position for the given goal.

The UNIQUE constraint U1 in GOAL_TASK ensures that the same task cannot be connected to the same goal multiple times. You can easily remove that constraint if you want to allow such repetition.

If you are interested in database modelling in general, you can take a look at ERwin Methods Guide.


1 Which is the implementation detail of the DBMS, but see clustering.

Upvotes: 1

Walter Mitty
Walter Mitty

Reputation: 18940

Your phrase "the order must be maintianed" can mean at least two different things.

It could mean that the order must be maintained at store time, by placing the new item in a location that will keep it in order.

It could also mean that the order must be maintained at retrieval time, by retrieving the itemsin the correct order.

If you mean the second thing above, it's fairly easy. You need one more column, one that will make the correct ordering explicit. For example, children might be ordered by their last name (alphabeticaly) or by their age, or by their weight (numberically). Books might be ordered by their title, or by their library retrieval number.

Then, when you want to retrieve the items, just include the "order by" clause in the SQL query that does the retrieving. No matter what order they are stored in, they will be delivered in the order you specify.

This retrieval process can be made a lot faster by creating an appropriate index. The cost is that adding new items will run a little slower, and the index will occupy some disk space.

There are situations where it takes more than one column to specify the right order.

But this covers the simplest case.

Upvotes: 0

Related Questions