I want to check if every element in a set contained in any part of rows in a column. For example, Set A = {5013, aab, 402dha) Column B adaf**5013**dad344 23**aab**yyyy zzz**402dha**vuuuda ..... ... I cannot use "where column B in A" because there are leading and trailing characters; I cannot use TRIM, SUBSTRING because the elements I want to search can be in any part of the column row; I can write 3 regular expressions in where clause when A has only 3 elements, but I cannot do that when A has ~1000 elements. My question is if there is a way I can do that? Or is there a function similar to 'VLookup' in Excel? I will appreciate any idea!

Reputation: 179

Can I use regular expression to match a set of values?

I want to check if every element in a set contained in any part of rows in a column. For example,

Set A = {5013, aab, 402dha)

Column B

adaf**5013**dad344  
23**aab**yyyy  
zzz**402dha**vuuuda  
.....  
...

I cannot use "where column B in A" because there are leading and trailing characters;
I cannot use TRIM, SUBSTRING because the elements I want to search can be in any part of the column row;
I can write 3 regular expressions in where clause when A has only 3 elements, but I cannot do that when A has ~1000 elements.

My question is if there is a way I can do that? Or is there a function similar to 'VLookup' in Excel?

I will appreciate any idea!

Upvotes: 1

Answers (1)

Craig Ringer

Reputation: 324701

That's a pretty horrid schema; best results will be achieved by changing it so you store your multiple values in:

Ordinary relational child tables;
Arrays; or
hstore

all of which allow you to use fairly simple and sane SQL expressions to determine what you want, and are indexable (via regular b-tree indexes for child tables, and via GiST or GIN indexes for arrays and hstore) for better performance on large tables.

It's certainly possible as-is, but performance will be miserable. One way is to use regexp_split_to_array to convert the column to an array, then use the array operators to test for overlap.

See this SQLFiddle demo, which uses an expanded test set because yours wasn't really adequate to demonstrate the problem.

I've shown both "any of set appears in column" (%%) and "all of set appears in column" (@>) since it isn't clear from the question which you want.

Setup:

CREATE TABLE test(gah text);

INSERT INTO test(gah) VALUES
('adaf**5013**dad344'),
('23**aab**yyyy'),
('zzz**402dha**vuuuda'),
('no**matches**here**lalala'),
('5013**aab**402dha'),
('402dha**aab**somethingelse**5013'),
('402dha**aab**5013');

Demo:

regress=> SELECT gah FROM test 
          WHERE regexp_split_to_array(gah, '\*\*') && ARRAY['5013', 'aab', '402dha'];
               gah                
----------------------------------
 adaf**5013**dad344
 23**aab**yyyy
 zzz**402dha**vuuuda
 5013**aab**402dha
 402dha**aab**somethingelse**5013
 402dha**aab**5013
(6 rows)

regress=> SELECT gah FROM test 
          WHERE regexp_split_to_array(gah, '\*\*') @> ARRAY['5013', 'aab', '402dha'];
               gah                
----------------------------------
 5013**aab**402dha
 402dha**aab**somethingelse**5013
 402dha**aab**5013
(3 rows)

Amazingly, you can actually create an index that will benefit this query by making use of PostgreSQL's support of expression indexes. Of course, just because you can do it doesn't mean it's a good idea:

regress=> CREATE INDEX test_glah_resplit_gin ON test 
          USING GIN(( regexp_split_to_array(gah, '\*\*') ));
CREATE INDEX
regress=> -- Only for testing purposes, don't use in production:
regress=> SET enable_seqscan = off;
SET
regress=> explain SELECT gah FROM test WHERE regexp_split_to_array(gah, '\*\*') @> ARRAY['5013', 'aab', '402dha'];
                                          QUERY PLAN                                           
-----------------------------------------------------------------------------------------------
 Bitmap Heap Scan on test  (cost=16.00..20.02 rows=1 width=32)
   Recheck Cond: (regexp_split_to_array(gah, '\*\*'::text) @> '{5013,aab,402dha}'::text[])
   ->  Bitmap Index Scan on test_glah_resplit_gin  (cost=0.00..16.00 rows=1 width=0)
         Index Cond: (regexp_split_to_array(gah, '\*\*'::text) @> '{5013,aab,402dha}'::text[])
(4 rows)

regress=> explain SELECT gah FROM test WHERE regexp_split_to_array(gah, '\*\*') && ARRAY['5013', 'aab', '402dha'];
                                          QUERY PLAN                                           
-----------------------------------------------------------------------------------------------
 Bitmap Heap Scan on test  (cost=16.00..20.02 rows=1 width=32)
   Recheck Cond: (regexp_split_to_array(gah, '\*\*'::text) && '{5013,aab,402dha}'::text[])
   ->  Bitmap Index Scan on test_glah_resplit_gin  (cost=0.00..16.00 rows=1 width=0)
         Index Cond: (regexp_split_to_array(gah, '\*\*'::text) && '{5013,aab,402dha}'::text[])
(4 rows)

GIN indexes are expensive to update, so you'll pay a significant performance price on insert/update if you use this approach. That's true with ordinary arrays; using regexp_split_to_table to create them on the fly just makes it a little worse. See GIN tips and the intro to GIN indexes.

For example, inserting a million rows into my test table with INSERT INTO test(gah) SELECT 'aaaaabbbbb'||(x::text) FROM generate_series(1,1000000) x; took 22 seconds with the GIN index in place, and 1.6 seconds after dropping it. That's likely to be a particularly bad case because of the uniformity of the values, though.

Upvotes: 5

Can I use regular expression to match a set of values?

Answers (1)

Related Questions