oligofren
oligofren

Reputation: 22923

Matching a substring against a stored list of strings

So, after the recent Twitch leak everyone is discussing their favorite bits of bad code. One bit that stood out was a single monster SQL statement to spot "illegal terms". That got me thinking about how to implement this "correctly".

So you have a table of strings that you want to match sub-strings against, how you write this in PL/pgSQL? I am assuming any SQL implementation should have procedural capabilities to create functions/procedures for such meta-programming, basically creating and executing SQL like the following:

admin@localhost:words> SELECT 'XabcY' LIKE '%abc%' OR 'XabcY' LIKE '%xyz%' as matches;
+-----------+
| matches   |
|-----------|
| True      |
+-----------+

So to be more specific, given a list of strings in table disallowed:

| illegal_string |
|----------------|
| stupid         |
| witless        |
| moron          |
| commie-lover   |

How would you create a dynamic query in PL/pgSQL that when executed returned true if any of these matched a given string? It does not need to use ILIKE in the Twitch like to check if the given word contained the substring, so using position is fine too, but it should be performant/tuneable using gin indices and whatnot.

Related issues.

Upvotes: 2

Views: 1754

Answers (2)

Stefanov.sm
Stefanov.sm

Reputation: 13049

I do not think that you need PL/pgSQL or dynamic SQL for this. Make a set of text from the table of disallowed words and match against the set items as regular expressions. Maybe not very performant using regexp magic, but I hope straightforward. The query below can of course be parameterized.

select '<text to examine>' ~* 
  any(select illegal_string from disallowed) as rude;
select 'You Stupido MORONE!' ~* 
  any(select illegal_string from disallowed) as rude;
-- yields true.

You may wish to restrict the search to whole words only. Then play with the regexp set and shape it like this:

select 'You Stupido MORONE!' ~* 
  any(select '\m'||illegal_string||'\M' from disallowed) as rude;
-- yields false

SQL Fiddle here

Upvotes: 1

Pavel Stehule
Pavel Stehule

Reputation: 45795

There are two possibilities:

  1. you can use LIKE ANY(array) operator
postgres=# select 'Ahoj' like any (ARRAY['Ah%', 'Na%']);
┌──────────┐
│ ?column? │
╞══════════╡
│ t        │
└──────────┘
(1 row)

postgres=# select 'Nazdar' like any (ARRAY['Ah%', 'Na%']);
┌──────────┐
│ ?column? │
╞══════════╡
│ t        │
└──────────┘
(1 row)

  1. you can use regular expression:
postgres=# select 'Ahoj' ~ '^(Ah|Na)';
┌──────────┐
│ ?column? │
╞══════════╡
│ t        │
└──────────┘
(1 row)

postgres=# select 'Nazdar' ~ '^(Ah|Na)';
┌──────────┐
│ ?column? │
╞══════════╡
│ t        │
└──────────┘
(1 row)

So in the end you don't need dynamic SQL.

There are ANSI/SQL syntax too:

postgres=# select 'Nazdar' similar to '(Ah|Na)%';
┌──────────┐
│ ?column? │
╞══════════╡
│ t        │
└──────────┘
(1 row)

So you can write something like:

DECLARE pw text[];
BEGIN
  pw := (SELECT array_agg('%' || disallowed || '%'
            FROM disallowed);
  IF EXISTS(SELECT * FROM foo WHERE c LIKE ANY (pw)) THEN
    RAISE NOTICE 'there are some disallowed words';
  END IF;
  ...

I am not sure about an performance. On larger table you need trigram index, or maybe better is using fulltext instead substring searching.

Upvotes: 1

Related Questions