Optimal use of LIKE on indexed column

Question

I have a large table (+- 1 million rows, 7 columns including the primary key). The table contains two columns (ie: symbol_01 and symbol_02) that are indexed and used for querying. This table contains rows such as:

id    symbol_01    symbol_02   value_01  value_02
1     aaa          bbb         12        15
2     bbb          aaa         12        15
3     ccc          ddd         20        50
4     ddd          ccc         20        50

As per the example rows 1 and 2 are identical except that symbol_01 and symbol_02 are swapped but they have the same values for value_01 and value_02. That is true once again with row 3 and 4. This is the case for the entire table, there are essentially two rows for each combination of symbol_01+symbol_02.

I need to figure out a better way of handling this to get rid of the duplication. So far the solution I am considering is to just have one column called symbol which would be a combination of the two symbols, so the table would be as follows:

id    symbol       value_01   value_02
1     ,aaa,bbb,    12         15
2     ,ccc,ddd,    20         50

This would cut the number of rows in half. As a side note, every value in the symbol column will be unique. Results always need to be queried for using both symbols, so I would do:

select value_01, value_02
from my_table
where symbol like '%,aaa,%' and symbol like '%,bbb,%'

This would work but my question is around performance. This is still going to be a big table (and will get bigger soon). So my question is, is this the best solution for this scenario given that symbol will be indexed, every symbol combination will be unique, and I will need to use LIKE to query results.

Is there a better way to do this? Im not sure how great LIKE is for performance but I don't see an alternative?

Clodoaldo Neto · Accepted Answer

With this index:

create index symbol_index on t (
    least(symbol_01, symbol_02),
    greatest(symbol_01, symbol_02)
)

The query would be:

select *
from t
where 
    least(symbol_01, symbol_02) = least('aaa', 'bbb')
    and
    greatest(symbol_01, symbol_02) = greatest('aaa', 'bbb')

Or simply delete the duplicates:

delete from t
using (
    select distinct on (
        greatest(symbol_01, symbol_02),
        least(symbol_01, symbol_02),
        value_01, value_02
    ) id
    from t
    order by
        greatest(symbol_01, symbol_02),
        least(symbol_01, symbol_02),
        value_01, value_02
) s
where id = s.id

Depending on the columns semantics it might be better to normalize the table as suggested by @Bohemian

Optimal use of LIKE on indexed column

Answers (2)

Related Questions