Reputation: 273
Advice on SQL tables, circular reference and foreign keys.
I'm very new to SQL (about a month or so), so please forgive any subsequent unfortunate naivete. I working on a project on stories where a user can start a story and another user can add to that story. At the moment, my two main tables are stories and paragraphs. Stories are made up of paragraphs. Paragraphs are just a chunk of text. The stories schema looks like this:
stid varchar not null primary key,
title text not null,
description text,
created_at timestamptz DEFAULT now()
The paragraph schema looks like this:
prid bigint not null primary key,
story varchar not null REFERENCES stories(stid),
maintext text,
writer text not null REFERENCES users(username),
parentpr bigint, //the previous paragraph
childpr bigint, //the next paragraph
created_at timestamptz DEFAULT now()
I am thinking of adding a headpara and lastpara column to the stories schema (with an ALTER) so I can easily access the first paragraph and last paragraph but that creates circular reference situation, since stories will ref paragraph and vice versa. Is this okay? Will it become a more when I begin to deal with larger amount of data and queries?
I have thought of a solution where I have another table: story-paragraph assignment. schema:
ID primary key
story REFERENCES stories(stid),
headpara REFERENCES paragraph(prid),
lastpara REFERENCES paragraph(prid)
for some reason, I am not convinced of this solution. It feels redundant to me. This is not a many-to-many situation. But paragraphs needs to reference stories and I need to be able access the first paragraph and last paragraph of a story.
Another possible solution could be having two boolean columns in the paragraph schema called head and tail so first paragraph can be called with
WHERE story == stID AND head == True.
Thoughts? This solution seems like it would be an issue when my paragraph table is very large. Many thanks in advance.
Upvotes: 4
Views: 1297
Reputation: 29934
I'd actually be disinclined to have a table of separate paragraphs in the first place.
When a writer edits their work, paragraphs are not some kind of hard dividing unit for them. When I revise my writing, moving sentences between paragraphs, rearranging paragraphs, merging paragraphs, separating paragraphs, and even deleting whole paragraphs are things that happen frequently. These kinds of updates will be very difficult for you to implement with the structure you've set up. This makes the division you've chosen questionable, and the problem you're facing is just another facet of how this structure is rather unnatural.
If you need to support editing the stories, then I might be inclined to look at non-relational databases (e.g., Couch or Mongo).
If I were stuck with PostgreSQL, I'd probably try out a single column containing the entire story first. The normal text types in PostgreSQL handles up to about 1 GB of text. This is probably big enough. Assuming each character is two bytes (an over-estimation for English with UTF-8) and each word is 10 characters and 1 space (again, an over-estimation), the column can hold stories of over 48 million words. If the paragraphs contain formatting marks, that number goes down, of course.
But this runs into other problems: moving that amount of text back and forth might be slow and maintaining the indexes on updates (probably a full text one) becomes expensive. The index problem might be worked around with a technology like Lucene or Solr; the problem of shuffling a lot of text back and forth is harder. If the stories you have to deal with are comparatively small, the normal full text mechanisms might be enough for you, though.
But the bottom line is that breaking stories up by paragraph makes building the software harder if stories can be edited, and you should rethink the architecture.
However, if editing is not a feature you need to support, you might be able to get away with breaking up the story by paragraphs strictly as an optimization. In this case, you would be inserting all of a story's paragraphs in bulk, allowing you to break them up into separate rows at import time. "Editing" would consist of deleting all the paragraphs and inserting a new set of them.
In this case, the "linked list" structure stops making a lot of sense. Linked lists optimize edits to a list (insertion and deletion are O(1)), but if breaking up a story by paragraphs is viable at all (as I described above), then edits within the list are an operation you no longer need to optimize. Instead, you'd be optimizing reads. This is likely to require random access of some kind. For example, you might read 5 paragraphs at a time, as the user scrolls through the story, which would require you to be able to start reading at an arbitrary paragraph somewhere in the middle.
This suggests a completely different and more natural way of organizing the table: put a column representing position on the paragraphs table. This column's value can be generated when bulk inserting the paragraphs. This makes fetching by position trivial. For example, to load the next paragraph while the user scrolls, you just track the position of the last paragraph you fetched for them (like paragraph 29) and then load the next five (WHERE position >= 30 and position <= 34
).
With this arrangement, your paragraph table might look like this:
CREATE TABLE paragraph (
paragraph_id SERIAL PRIMARY KEY,
story_id INTEGER NOT NULL REFERENCES stories (story_id),
position INTEGER NOT NULL,
-- Other columns
created_at TIMESTAMPTZ DEFAULT now()
)
This does leave one remaining question, which is actually your original question. How do you fetch the last paragraph with this set up? And that's actually not very hard:
SELECT *
FROM paragraph
WHERE story_id = 30
ORDER BY position DESC
LIMIT 1
The key here is to ORDER BY
the position in reverse order and then use LIMIT
to tell the DB you only want the first row after sorting. This is a very efficient query. It would probably make sense to create a combined index between the story's ID and the position to optimize this query if you run it often:
CREATE INDEX ON paragraphs (story_id, position)
Although, with the linked list structure gone, querying for the last paragraph may not make sense anymore.
Note that either way, the linked list structure goes away. This makes sense. Relational databases are optimized for random access, and the sequential access of a linked list runs against the grain. If you really need linked list style access, there's a good chance that the relational database isn't a good fit for your data. Graph DBs are a natural fit for linked lists style access: they work in terms of nodes and edges between them. (Note that this is not particularly common.)
Upvotes: 2
Reputation: 1270301
You can do the solution either way. If you know that the head paragraph and last paragraph are really important, then having references to them in the story is fine.
In either case, there is a bit of a challenge maintaining relational integrity. Presumably, you want the head and last paragraphs to be in the same story. For this, you will want a composite key. And you need to add the key using a separate alter table
statement. So:
alter table paragraph add constraint unq_paragraph_story_prid unique (story, prid);
alter table stories add constraint fk_stories_headpara
foreign key (stid, headpara) references paragraph(story, prid);
alter table stories add constraint fk_stories_lastpara
foreign key (stid, lastpara) references paragraph(story, prid);
Similarly, if you use the flags, you will need to ensure that there is exactly one flag of each type set. That can be a bit of a pain when updating. That constraint would look like:
create unique index unq_paragraph_headpara paragraph(story) where head = 1;
create unique index unq_paragraph_lastpara paragraph(story) where last = 1;
Notes about naming and other things:
id
s should be numeric, if they can be. This simplifies foreign key references.paragraphId
or paragraph_id
) or simply id
. If you use prid
, that could get confused with another table.Upvotes: 1