Reputation: 11
I have a table with 3 columns:
item_id (decimal), key_name (varchar), key_string_value (varchar)
It is populated with hundreds and thousands of rows. Below are the first 6 rows to give you an idea of the data.
1. 1 product product1
2. 1 topic topic1
3. 1 segment segment1
4. 2 product product2
5. 2 topic topic1
6. 2 segment segment1
Each distinct item_id
has 3 rows associated with it and describing it with metadata like topic, product, segment. I am trying to write a SQL query which does the following and performs extremely well.
Now I need to select all item_ids which will suffice a certain combination of these metadata for example:
item_id
's for which topic='topic1'
and product='product2'
and segment='Segment1'
I am trying both self joins and correlated queries but am not sure which direction should I continue for the best performance. Any help would be appreciated.
Upvotes: 1
Views: 133
Reputation: 65304
First of all: I don't think this has a well-performing solution without an index on item_id
, so I think you should change your PK to a composite of (item_id, key_name)
- you need this to be unique anyway.
Assuming you have done so, we now have a fast track to both item_id
and key_name
, which I consider to be a necessity.
We can now try a self join, which should give good results on MySQL, as in InnoDB all composite indices are clustered:
SELECT
one.item_id
FROM table_name AS one
INNER JOIN table_name AS two
ON two.item_id=one.item_id
INNER JOIN table_name AS three
ON three.item_id=one.item_id
WHERE one.key_name='product'
AND one.key_string_value='product1'
AND two.key_name='topic'
AND two.key_string_value='topic1'
AND three.key_name='segment'
AND three.key_string_value='segment1'
;
It is important to apply the selector with the highest selectivity to the driving table. i.e. one
- in my query I have assumed, that product
has a higher selectivity than topic
or segment
.
Upvotes: 0
Reputation: 181008
Correlated subqueries are usually about the least performant solution to any problem unless the DB can transform them into joins, as some query planners can do in some cases. In any optimization process, though, it is wise to check and compare the plans the DB actually comes up with for different queries and different conditions, using the DB's mechanism for that (e.g. EXPLAIN PLAN
in some databases).
My first pass at this task would take this form:
SELECT product.item_id
FROM
(
SELECT item_id
FROM my_table
WHERE key_name = 'product' AND key_string_value = 'product2'
) product
JOIN (
SELECT item_id
FROM my_table
WHERE key_name = 'topic' AND key_string_value = 'topic1'
) topic
ON product.item_id = topic.item_id
JOIN (
SELECT item_id
FROM my_table
WHERE key_name = 'segment' AND key_string_value = 'segment1'
) segment
ON topic.item_id = segment.item_id
That assumes the table has a primary key constraint -- or at least a uniqueness constraint -- on (item_id, key_name)
; if it doesn't then the inline views should use SELECT DISTINCT
. Additionally, it may benefit significantly from an index on (key_name)
, or even more from an index on (key_name, key_string_value)
.
Before you go crazy creating indexes, though, be sure you understand that though they may speed queries, they do slow inserts, updates, and deletes, and they increase the footprint of your database. And again, check the query plans, and test the performance.
Update:
Given that we now find there is indeed an index on (key_name, key_string_value)
, and that the wide variety of key_string_value
s makes such an index highly selective, I'm inclined to think that the above approach will do quite well. When you test, don't forget to SELECT DISTINCT
if you cannot otherwise rely on the inline views to avoid providing duplicate item_id
s.
Upvotes: 1
Reputation: 34784
You can use HAVING
and conditional aggregation for this:
SELECT item_id
FROM YourTable
GROUP BY item_id
HAVING MAX(CASE WHEN key_name = 'topic' AND key_string_value ='topic1' THEN 1 END) = 1
AND MAX(CASE WHEN key_name = 'product' AND key_string_value ='product2' THEN 1 END) = 1
AND MAX(CASE WHEN key_name = 'segment' AND key_string_value ='segment1' THEN 1 END) = 1
Is it the most performant? Will have to test, but I'd expect it to beat self-joins and correlated subqueries.
Upvotes: 1