Reputation: 22734
I need to design a database schema for the following problem. Consider this simplified grammatical 'analysis' of some example phrase:
In this model a sentence consists of an array of words, and a word is made up of an array of word parts/morphemes. Relational databases are – as I am learning, notoriously – not very happy about arrays of arrays.
I see two solutions and am unsure how to make the right decision. The first, 'dirty' solution: a single intermediary table which links sentences with morphemes, and stores the array indices. Lots of identical entries in columns.
CREATE TABLE word ( -- pseudo-SQL
sentence_id FOREIGN KEY,
sentence_order INTEGER,
morpheme_id FOREIGN KEY,
morpheme_order INTEGER );
The second, 'clean' solution: Three (!) intermediary tables, probably slow and uncomfortable to use? Note how the word table only serves up IDs for the two foreign key tables to use.
CREATE TABLE sentence_word (
sentence_id FOREIGN KEY,
word_id FOREIGN KEY,
order INTEGER );
CREATE TABLE word ( id );
CREATE TABLE morpheme_word (
morpheme_id INTEGER FOREIGN KEY,
word_id INTEGER FOREIGN KEY
order INTEGER );
I would normally prefer a clean solution but here the clean solution has a kludgy feel to it. I'm trying to do this with a web framework ORM, by the way (Django).
Upvotes: 0
Views: 2664
Reputation: 22734
If we take the data structure involved as an array of array values, there is a simple alternative solution that is clean, efficient, and intuitive to work with:
CREATE TABLE Sentence (...);
CREATE TABLE Word ( sentence_id FOREIGN KEY,
order INTEGER );
CREATE TABLE Morpheme ( word_id FOREIGN KEY,
order INTEGER );
This is simply a 1-to-N relationship, twice. (With Django's ORM you could then simply say word.sentence
to access the sentence a Word instance belongs to, or sentence.word_set.order_by('order')
to get the ordered set of words in some sentence.)
The drawback of this design is that array items which occur many times, such as -ly in extraordinari-ly, are stored many times in the database, once for each occurrence. It is not possible to associate additional data with all -ly morphemes.
Upvotes: 0
Reputation: 1144
Your second solution is the technically correct one. The kludge you're feeling is in fact not due to the array of arrays problem but to the fact that you have a many to many relationship between sentence and word, and between morpheme and word. (Any given sentence could be one or many words, and any word could be part of one or many sentences.) This is a normal kludge that is an (unfortunate?) side effect of SQL.
Since you mention django, django attempts to abstract some of this for you with many-to-many fields.
I think as a basic model for your django you're looking at something like this:
class Sentence(models.Model):
words = models.ManyToManyField(Words, through=SentenceWord)
class Word(models.Model):
morphenes = models.ManyToManyField(Morphene, through=MorpheneWord)
class Mophene(models.Model):
pass
#--- Intermediate Tables ------------
class SentenceWord(models.Model):
sentence = models.ForeignKey(Sentence)
word = models.ForeignKey(Word)
position = models.IntegerField()
class MorpheneWord(models.Model):
word = models.ForeignKey(Word)
morphene = models.ForeignKey(Morphene)
position = models.IntegerField()
Admittedly, I just typed up those models here, but that should get you close to where to need to be.
EDIT: Introduced Word Model.
Upvotes: 2
Reputation: 52157
You'll have a hard time enforcing proper morpheme order in the first design, and for that reason I like the second design better. However, if performance is the issue, the first design might allow you to do less JOINing.
If you happen to use Oracle, you might be able to have your cake an eat it too, by marrying the second design for "cleanliness" with materialized views for performance.
Upvotes: 1