Markus
Markus

Reputation: 1292

RegEx in SQL Query

I have a SQL Query containing a

REGEXP_REPLACE(LISTAGG(foo.name, ',') WITHIN GROUP (ORDER BY foo.name), '([^,]+)(,\1)+', '\1') AS bar

in its SELECT. LISTAGG concatenates the values of the foo.name column with a ',' as separator, whereas REGEXP_REPLACE replaces duplicates.

When I change the LISTAGG so that a ", " (comma followed by a whitespace character) is used as separator, how do I have to adjust the REGEXP_REPLACE?

Edit:

When changing the REGEXP_REPLACE to '([^,]+)(, \1)+' it seems to work and I get

CITRONENSÄURE, KALIUMSORBAT, PEKTIN

But when changing the REGEXP_REPLACE to '^([^,]+)(, \1)+$' I still have duplicates:

CITRONENSÄURE, CITRONENSÄURE, CITRONENSÄURE, CITRONENSÄURE, CITRONENSÄURE, KALIUMSORBAT, KALIUMSORBAT, KALIUMSORBAT, KALIUMSORBAT, KALIUMSORBAT, PEKTIN, PEKTIN, PEKTIN, PEKTIN, PEKTIN

So, is '([^,]+)(, \1)+' the right one, regardless of having no ^ and $?

Upvotes: 0

Views: 703

Answers (1)

Mariano
Mariano

Reputation: 6521

Exactly what you did in LISTAGG, add a space after the comma

'([^,]+)(, \1)+'

Though you should make sure it matches the whole string, anchoring at the beggining of the string (^) and at the end ($).

'^([^,]+)(, \1)+$'


Edit:
As for the regex syntax, to answer the edit in your question, if you want to remove duplicates within the same line, you have to make sure it's matching the entire value. For example, '([^,]+)(, \1)+' will match in 'fo[o], [o]ther'. So we'll match the trailing comma as well (or end of string).

([^,]+), (\1(, |$))+

This expression is much safer than the one you were trying to use. However, there are a few cases where it could fail.

SQL:

select regexp_replace(
    listagg("name", ', ') within group (order by "name")
    ,'([^,]+), (\1(, |$))+', '\1\3')
as "bar"
from foo;

DEMO in SQLfiddle



When could it fail?
The expression is not 100% safe because the first word is not anchored, and thus it can match at the middle of a value. For example, it will remove WORD from the list:

'AWORD, WORD, XXX'  ==>  'AWORD, XXX'
  ====--^^^^--

I don't think there's a way to avoid this, since Oracle implements POSIX EREs and does not support lookarrounds, word boundaries nor \G assertions. Moreover, this is not the proper way to remove duplicate values. With really long tables, you'll end up in ORA-01489: result of string concatenation is too long.

I'd recommend using GROUP BY or DISTINCT for that case. You can read about it in this article: SQL/mysql - Select distinct/UNIQUE but return all columns?.

Upvotes: 1

Related Questions