Blind Fury
Blind Fury

Reputation: 469

The most efficient way to check a matrix offset

This is a database design and query design question, but is a little difficult to formulate.

I am investigating the relationships between vowel sounds in language. Each vowel can be described by its position in a two-dimensional vowel space. One dimension is "height", usually represented by 5 positions between "open" and "close". The second dimension is "backness", usually represented by 3 positions between "back and front. Thus, a vowel can have 3 x 5 = 15 positions. It can also be rounded or unrounded, thus giving 30 permutations (in an extremely simple model).

I want to see the effects of shifting a parameter in one of the dimensions only by a single step, i.e. either the height is shifted by one step or the backness is shifted by one step or the rounded/unrounded property is toggled. In each case, the other two parameters are untouched.

I have a couple of hundred thousand words in a database with the vowel positions theoretically known (i.e. I can map them any way I want, the most obvious being an integer value for height and backness and a boolean for rounded/unrounded).

Now I want to look for words in the database whose (last) vowel satisfies the criteria above in relation to a vowel I select.

If we say that the vowel I am using is at 2,2 on the matrix and is rounded, The sql would look something like this:

SELECT word FROM mytable WHERE (rounded = false AND height = 2 AND backness = 3)
OR (rounded = true AND height = 2 AND (bacKness = 1 OR backness = 3))
OR (rounded = true AND backness = 2 AND (height = 1 OR height = 3))

Clearly, this would do the job, and there is no problem with doing it programmatically (particularly as values outside the matrix size would simply not match, which is fine).

what I am wondering is whether such queries become unrealistic and do not perform well if I were to use a more realistic 6 x 10 matrix and allow multiple degrees of freedom.

Does anyone have any idea as to the best way of representing and querying matrix data like this with SQL?

I hope I have been halfway clear!

Upvotes: 0

Views: 55

Answers (2)

Jonathan Leffler
Jonathan Leffler

Reputation: 754570

As long as you have appropriate indexes available, you should be OK. What's an appropriate index, you ask? Good question! You need at least one index on all three of backness, height and rounded. For the query shown, it will depend on how good the optimizer is.

SELECT word
  FROM mytable
 WHERE (rounded = false AND height = 2 AND backness = 3)
    OR (rounded = true  AND height = 2 AND (backness = 1 OR backness = 3))
    OR (rounded = true  AND backness = 2 AND (height = 1 OR height = 3))

If the optimizer rewrote the query as:

SELECT word
  FROM mytable
 WHERE (rounded = false AND height = 2 AND backness = 3)
    OR (rounded = true  AND height = 2 AND backness = 1)
    OR (rounded = true  AND height = 2 AND backness = 3)
    OR (rounded = true  AND backness = 2 AND height = 1)
    OR (rounded = true  AND backness = 2 AND height = 3)

which can then be revised to:

SELECT word
  FROM mytable
 WHERE (rounded = false AND height = 2 AND backness = 3)
    OR (rounded = true  AND height = 2 AND backness = 1)
    OR (rounded = true  AND height = 2 AND backness = 3)
    OR (rounded = true  AND height = 1 AND backness = 2)
    OR (rounded = true  AND height = 3 AND backness = 2)

then it may be able to make effective use of an index with the columns in any order. On the other hand, if the optimizer is not capable of such rewrites, then it would likely fall back on a table scan. At this point, you have to investigate how good the SQLite optimizer is. With a couple hundred thousand records, on a half-way decent machine (say, one that is from this millennium, or preferably this decade), there shouldn't be too much problem — almost all the database should fit in memory (if SQLite uses memory caching for data). It depends a bit on what other data is associated with each word.

If you ever ended up with a query on only two of the three attributes, then you need the index to have the first two of its column being the ones you reference (so if you refer to backness and rounded, you need the index to be on (backness, rounded, height) or (rounded, backness, height) so the index can still be used. Taken to extremes, this could mean 7 indexes (3 on single columns, 3 on each possible pair of columns, 1 on all three columns). It would 'work'; it might use more disk space storing the index than is warranted by the performance gain from storing them and the performance loss from having to update 7 indexes every time you insert a row or modify the vowel sound attributes of a row.

The OR formulation can cause trouble for DBMS optimizers. The nested OR formulation might cause more trouble. As originally written, it is not immediately obvious that you can use a single index. But it all hinges on the DBMS and the optimizer.

Check out how a UNION (or UNION ALL) query with the alternatives performs, too.

SELECT word
  FROM mytable
 WHERE (rounded = false AND height = 2 AND backness = 3)
UNION
SELECT word
  FROM mytable
 WHERE (rounded = true  AND height = 2 AND backness = 1)
UNION
SELECT word
  FROM mytable
 WHERE (rounded = true  AND height = 2 AND backness = 3)
UNION
SELECT word
  FROM mytable
 WHERE (rounded = true  AND height = 1 AND backness = 2)
UNION
SELECT word
  FROM mytable
 WHERE (rounded = true  AND height = 3 AND backness = 2)

UNION ALL is safe, possibly even sensible, since the lists of words from each term in the UNION gives a disjoint set of words (no duplicates — unless the same word is given with different vowel properties).

You could also try the UNION with the OR'd terms from the original variant of the query.

Upvotes: 1

Googie
Googie

Reputation: 6057

SELECT matched.word
  FROM mytable orig, 
       mytable matched
 WHERE (
           (matched.rounded <> orig.rounded)
           +
           (abs(matched.height - orig.height) > 1) /* validate "height" */
           +
           (abs(matched.backness  - orig.backness) > 1) /* validate "backness" */
           /* + (...) add more column conditions here if you need */
       ) = 1    /* this is how many columns can be different */

Using this query you can just define what columns you want to be included in comparision and how many differences you want to allow.

Note that in results you will get list of words that have exactly 1 column different (not less, not more). If you want to include in results also rows that had no differences, change "= 1" to "<= 1" at the end.

The abs(matched.height - orig.height) > 1 condition means that the difference in column height is greater than 1 and for this column the condition will increment total number of differences by 1.

You can also change every column condition (from "> 1" to any greater value), so it allows greater ranges of single column value differences.

Upvotes: 1

Related Questions