KMK
KMK

Reputation: 1509

SQL performance for string field vs multiple int/varchar fields

I'm trying to get the database design right, and I'm uncertain about which of 2 options I should choose. There will be around 20 bool values that are used to filter (4 in the sample for simplicity).

Option 1)

A int(1) or varchar(1) field for each prop. The filter part of the query could be like

WHERE prop1=1 AND prop3=1 AND prop4=1

Option 2)

Using a single text field with characters representing the props

Then the filter part of the query similar to option 1 would be like

WHERE props LIKE '%a%' AND props LIKE '%c%' AND props LIKE '%d%'

or if the characters are sorted:

WHERE props LIKE '%a%' AND props LIKE '%cd%'

My thought is that option 2 makes it easier to add new props so I like this option, but will the LIKE comparator perform worse than the equality comparator? Does it make any difference to use a single text compared to multiple int(1) or varchar(1)? Any other benefits or drawback that I haven't thought of?

Upvotes: 0

Views: 951

Answers (2)

Rick James
Rick James

Reputation: 142268

The primary question is whether you can run faster than scanning the entire table. The answer is "no" unless a small number of the booleans can be handled separately with Index(es).

Your WHERE bools LIKE '%a%c%d%' is a clever trick for ANDing any number of flags together. However, it will need to look at every row, and LIKE is slightly heavyweight.

INT(1) takes 4 bytes plus overhead. TINYINT is what you are fishing for; it takes 1 byte, plus overhead.

A SET with up to 64 bools is another technique. The coding is a bit clumsy, but it is rather efficient

INT UNSIGNED (for up to 32) or BIGINT UNSIGNED (for up to 64) flags is implemented similarly to SET and also takes up to 8 bytes. But the coding is rather clumsy. Let's number the bits starting with 0 in the least significant bit.

WHERE (bools & ( (1 << 0) | (1 << 2) | (1 << 3) ) ) = 
               ( (1 << 0) | (1 << 2) | (1 << 3) )

would check that bits 0, 2, and 3 are all set. (This is like your test for a,c,d.) A variety of ANDs and ORs are possible with this approach. (You could pre-compute those bit values-- 13 in this example. Or use a bit literal: 0b1101.)

The benefit of SET or bits in an INT is the 'speed' within each row. Still, all rows must be tested.

So, I recommend triaging your bools, etc, and decide what needs indexing and what can go into this combined column or in a combined JSON column for non-bools.

Upvotes: 1

Peter Koltai
Peter Koltai

Reputation: 9734

From performance's point of view both options have drawbacks:

  1. It is almost impossible to have proper indexing for 20 columns, regardless of their data types, because queries can vary from setting condition prop1, prop1 + prop2, prop2 + prop3, prop2 + prop4 etc. So you would need a lot of composite indexes. In addition, indexes on boolean values usually don't perform well because of low cardinality (value can be 0 or 1 only, see here for example).

  2. On the other hand side, LIKE statements with leading % are also a performance issue. Closing % can work, but leading % is determined to be slow.

What I see here that you want to assign a set of attributes to your records, each record can have 0..n assigned propX, and you want to filter that efficiently. Like a user can have 0,1,2,...n roles assigned. In relational databases it classifies as a classical many-to-many relationship. If you set up a table with the possible props and connect those to your records with a join table, like detailed here for example, you can make good queries with only a limited number of indexes.

Upvotes: 0

Related Questions