Malt
Malt

Reputation: 30295

Storing and referencing an immutable ordered list in a relational database

Background: I have a database containing parents and children's names (this is a simplification of the actual data, but the analogy is close enough).

Task: The database has to store an ordered list of children's names for each parent.

Assumptions:

  1. The database will contain millions of parents, possibly more.
  2. Parents typically have no more than to 4 or 5 children, but rare (and even extreme) cases have to be supported as well.
  3. Children's names (as well as the ordering) tends to repeat a lot. So parents should reference some children_names_list_id instead of keeping a copy of the actual names.
  4. Children's names as well as their ordering for a particular parent are immutable.
  5. Insertion of new parents will be very frequent. When a new parent is inserted with a list of his children, if such a list of names already exists in the database, the new parent should reference the existing list identifier.
  6. Queries about names and their ordering should be possible (for example - find all parents that names a child "Bob" after naming a child "Alice", or find all parents that named a child "Alice", then had two more children, with the third named "Carol" etc)

Questions:

  1. What's the best way to store such lists? The solution should be robust and support fast parent insertions.
  2. How should a parent reference the lists?

Current (proposed) solution:

My current approach is to have a table that maps children names to integer name ids (names are long, integers are short). Then store name lists in the following tuples: <list_id> <order> <name_id> so the list table will look like this:

<list_id> <order> <name_id>
    1       1       123
    1       2       345
    1       3       678
    2       1       901
    3       1       123
    3       1       901

The example table contains three lists: [123,345,678], [901], [123,901] which might correspond to something like: ["Alice", "Bob", "Carol"], ["Dave"], ["Alice", "Dave"] The parents table will then have a children_list_id column that references the list_id column.

This solution seems to be robust, except for two issues:

  1. I'm not sure whether insertions will be fast enough (looking up whether an existing list already exists seems like it could be slow), but other approaches seem to be less robust or (much) harder to query.
  2. The key to the name list table is composed of both the list_id and the order columns; the parents table has to reference only the list_id which should be a foreign key, but since list_id isn't a key by itself in the list table, an additional table of lists, in which list_id is key is needed. This seems cumbersome.

Alternate solution:

The table of lists will store implicit ordering in the columns:

<list_id> <name_1> <name_2> <name_3> <name_4> ... <name_100>
    1        111     222     333      null
    2        444     null
    3        555     111     null

In this table, list_id will be the primary key.

The parents table will keep the list_id as a foreign key.

This solution is somewhat less robust (how many columns do I create? 10? 20? 50?), but makes insertions much quicker. And since the list_id is a key, no additional tables are needed. A possible downside however is that some queries become much more complicated since they have to reference multiple columns.

Thanks!

Upvotes: 1

Views: 689

Answers (1)

Gordon Linoff
Gordon Linoff

Reputation: 1269953

The list table is over-design. Just have a Parents table, a Names table and a ParentChildren table. The ParentChildren table is just like your list table, except for a few details. It would look like:

<ParentId> <Order> <NameId>
    1         1     123
    1         2     345
    1         3     678
    2         1     901
    3         1     123
    3         1     901

I don't see a particular savings to storing independent lists. Just store the children for each parent.

Upvotes: 1

Related Questions