sqlite, xml, or a combination?

Question

I'm designing a system which allows searching through legislation for related clauses, and I'm having some difficulty in figuring out the best way to store the data. Here are the criteria:

The legislation concerned is in a tree-like structure. Each Act contains Sections, which may have Subsections to any depth (e.g. Act1: 2.1.a.c). Each section or subsection at every level is a single clause. Acts may also contain Regulations, which are basically annexures, and contain a similar set of sections and subsections. Each act and regulation have a date in which they took effect (not necessarily the same). A simple example of the structure is:

Act1: "Act controlling something" (2001)
    Section 1: This section relates to:
        a.     Something
        b.     These things too:
            1.    A long thing
            2.    A short thing
    Regulation 1: (12 Jan 2004)
        Section 5: This section relates to Section 1 of the main Act
            a.   This applies to everything short
            b.   This applies to everything long
    Regulation 2: (14 Feb 2008)
        Section 6: This section relates to Sections 1 and 2 of the main Act
            a.   This applies to all everything in the sections
            b.   This applies to something

Clauses and sections are related to each other based on subjective criteria, and the relations need to be set up manually, but they could apply on any level, e.g. Act1.Regulation1.Section5.a -> Act1.Section1.b.2 or Act1.Regulation2.Section6 -> Act1.Section1. The relationships do not necessarily happen in both directions.
The system needs to be able to query these relationships, so that a search for Act1.Section1 will bring up everything marked as related to it or any of its subsections, possibly also constrained by date.
The system needs to be in a stand-alone environment, so file-based, not server-based.
The data will be read-only to users.

The frontend and search engine are quite straight forward, and however I store the data I'll probably implement it using python.

Since the backend needs to be file-based, I assume sqlite would be the easiest database to use. However, I'm not entirely convinced that xml isn't a better way of doing it. My only concern is that a database might be easier to integrate with other systems later on, if necessary. I could also combine the two, using xml to store all of the legislation, and a sqlite table containing all the links.

In brief, here is my question: What would be the most appropriate storage structure for this type of data?

hroptatyr · Accepted Answer

I'd go for the ``combination'' option as you suggested.

store the actual contents in XML
store the relations as pairs of (absolute) XPaths in the database, e.g. ['/Act1/Regulation1/Section5/a', '/Act1/Section1/b/2']; ['/Act1/Regulation1/Section5/a', '/Act1/Regulation2/Section6'] (that's 2 rows of 2 columns each, first column isn't unique, pairs are unique)
store the inverse relation in a separate table with the same layout (for reverse look-ups): ['/Act1/Section1/b/2', '/Act1/Regulation1/Section5/a']; ['/Act1/Regulation2/Section6', '/Act1/Regulation1/Section5/a'] (same uniqueness constraints as above)

If you want to do partial look-ups a la "Show me if /Act1/Regulation2 is referred to by /Act1/Regulation1" you could either add an indirection to the cross tables (shown below), or if you need extreme performance (which I don't think you do as there's probably not much data (less than a 100 million relations)) you could use a two-fix trie (as in a prefix and suffix trie)

relation_table (id, id):
[set_112, set119]
[set_112, set120]

set_table (id, prefix, is_full_path):
[set_112, '/Act1', false]
[set_112, '/Act1/Regulation1', false]
[set_112, '/Act1/Regulation1/Section5', true]
...

It's the set of all prefixes (and/or suffixes) of the XPaths. Answering the query above would then be:

SELECT set_id FROM set_table WHERE prefix = FOO
SELECT second_colum FROM relation_table WHERE first_column IN (set_112, ...) (i.e. the result of the former query)
SELECT prefix FROM set_table WHERE set_id = RESULT_FROM_PREVIOUS AND is_full_path=true

Forgive me if I didn't reproduce your example relations right in my examples.

Edit: Depending on what your needs are regarding the output (and how big the inital XML file is) I'd even avoid XPath look-ups into the original document and instead store fragments of XML in the database (like all the actual text nodes that belong to each XPath) and then reconstruct a smaller version of the original XML on the fly. XML parsing (and indexing) is slow, printing something as XML is not.

sqlite, xml, or a combination?

Answers (2)

Related Questions