Chris
Chris

Reputation: 14218

mysql best index order for (FK, key, value)

I have a key value table:

id, data_id, key, value

that references a data table:

id, file_id, data

which in turn references a files table:

id, name

files contains about 10000 entries data contains about 100 million entries

The key column has about 100 distinct values The value field has many distinct values (strings)

Now I am wondering which order the three columns data_id, key, value should have in my index.

The queries that I am interested in look for data entries that have certain file_id and specific key value combinations. E.g.:

SELECT * FROM files, data, keyval
WHERE files.id = data.file_id
AND data.id = keyval.data_id
AND files.id = 999
AND keyval.key = 'revenue'
AND keyval.val = 9.99

Upvotes: 4

Views: 381

Answers (2)

Rick James
Rick James

Reputation: 142316

A key-value schema sucks. As your queries get more involved, you will understand why.

One thing you can do to help some is to follow the tips here.

Also, please specify the relationships in the tables via JOIN ... ON, not the old fashioned FROM a,b WHERE a.x = b.y

Upvotes: 0

spencer7593
spencer7593

Reputation: 108430

The "best order" of the columns in an index will depend on the actual query (queries) that are accessing the table.

If the only predicate in the queries we run are WHERE data.key = 'foo', then the most benefit would be with an index that has key as the leading column.

If we don't ever have any queries that have a predicate on the key column (in a WHERE clause or an ON clause), and no queries do a GROUP BY data.key, ..., then we probably don't need to have key as the leading column in any index.

Sometimes we add multiple indexes, with different sets and orders of columns, in order to satisfy a variety of queries.

The choice of order of columns in an index depends not only on the datatypes and value distribution, but also on the performance requirements for the actual queries that we are planning to run.


EDIT

The query added to the question is of the form:

 SELECT f.*
      , d.*
      , k.*
   FROM files f
   JOIN data d
     ON d.file_id = f.id
   JOIN keyval k
     ON k.data_id = d.id
    AND k.key     = 'revenue'
    AND k.val     = 9.99
  WHERE f.id = 999

Assuming
id is the PRIMARY KEY (or a UNIQUE KEY) in files, and
id is the PRIMARY KEY (or a UNIQUE KEY) in data

I'd create indexes

 ON data (file_id,id)
 ON keyval (data_id,key,val)

I'd look at the EXPLAIN output for the query to see the query plan, and which indexes are being used.


If we didn't have the equality prediate on file.id, and if key='revenue' and val=9.99 is very selective, then we would probably want keyval as the driving table. And we would make sure an index avaiable that has key and val as the leading columns, e.g.

ON keyval(key,val,data_id)

We would want that index if the query was equivalent to:

 SELECT f.*
      , d.*
      , k.*
   FROM keyval k
   JOIN data
     ON d.id  = k.data_id
   JOIN files f
     ON f.id  = d.file_id
 -- AND f.id  = 999
  WHERE k.key = 'revenue'
    AND k.val = 9.99

Also, do we really need to pull all of the columns from all three tables? Or are we just using * as a shortcut just to avoid us listing the columns we actually need?

Upvotes: 3

Related Questions