Reputation: 14218
I have a key value table:
id, data_id, key, value
that references a data table:
id, file_id, data
which in turn references a files table:
id, name
files contains about 10000 entries data contains about 100 million entries
The key column has about 100 distinct values The value field has many distinct values (strings)
Now I am wondering which order the three columns data_id, key, value
should have in my index.
The queries that I am interested in look for data entries that have certain file_id and specific key value combinations. E.g.:
SELECT * FROM files, data, keyval
WHERE files.id = data.file_id
AND data.id = keyval.data_id
AND files.id = 999
AND keyval.key = 'revenue'
AND keyval.val = 9.99
Upvotes: 4
Views: 381
Reputation: 142316
A key-value schema sucks. As your queries get more involved, you will understand why.
One thing you can do to help some is to follow the tips here.
Also, please specify the relationships in the tables via JOIN ... ON
, not the old fashioned FROM a,b WHERE a.x = b.y
Upvotes: 0
Reputation: 108430
The "best order" of the columns in an index will depend on the actual query (queries) that are accessing the table.
If the only predicate in the queries we run are WHERE data.key = 'foo'
, then the most benefit would be with an index that has key
as the leading column.
If we don't ever have any queries that have a predicate on the key
column (in a WHERE clause or an ON clause), and no queries do a GROUP BY data.key, ...
, then we probably don't need to have key
as the leading column in any index.
Sometimes we add multiple indexes, with different sets and orders of columns, in order to satisfy a variety of queries.
The choice of order of columns in an index depends not only on the datatypes and value distribution, but also on the performance requirements for the actual queries that we are planning to run.
EDIT
The query added to the question is of the form:
SELECT f.*
, d.*
, k.*
FROM files f
JOIN data d
ON d.file_id = f.id
JOIN keyval k
ON k.data_id = d.id
AND k.key = 'revenue'
AND k.val = 9.99
WHERE f.id = 999
Assuming
id
is the PRIMARY KEY (or a UNIQUE KEY) in files
, and
id
is the PRIMARY KEY (or a UNIQUE KEY) in data
I'd create indexes
ON data (file_id,id)
ON keyval (data_id,key,val)
I'd look at the EXPLAIN output for the query to see the query plan, and which indexes are being used.
If we didn't have the equality prediate on file.id
, and if key='revenue' and val=9.99
is very selective,
then we would probably want keyval
as the driving table. And we would make sure an index avaiable that has
key
and val
as the leading columns, e.g.
ON keyval(key,val,data_id)
We would want that index if the query was equivalent to:
SELECT f.*
, d.*
, k.*
FROM keyval k
JOIN data
ON d.id = k.data_id
JOIN files f
ON f.id = d.file_id
-- AND f.id = 999
WHERE k.key = 'revenue'
AND k.val = 9.99
Also, do we really need to pull all of the columns from all three tables?
Or are we just using *
as a shortcut just to avoid us listing the columns we actually need?
Upvotes: 3