Vérace
Vérace

Reputation: 908

Extract text up to the n-th character in a string, but return the whole string if the character isn't present

I was looking at this question and the accepted answer gives this as a solution for the case when there are fewer than n characters in the string:

^(([^>]*>){4}|.*)

However, I have done a fiddle here, and it shows that this regex will just simply return the entire string all of the time.

This code:

SELECT
  SUBSTRING(a FROM '^(([^>]*>){4}|.*)'),
  a,
  LENGTH(SUBSTRING(a FROM '^(([^>]*>){4}|.*)')),
  LENGTH(a),

  LENGTH(SUBSTRING(a FROM '^(([^>]*>){4}|.*)')) = LENGTH(a)
  
FROM s
WHERE   LENGTH(SUBSTRING(a FROM '^(([^>]*>){4}|.*)')) = LENGTH(a) IS false;

after several runs returns no records - meaning that the regex is doing nothing.

Question:

I would like a regex which returns up to the fourth > character (not including it) OR the entire string if the string only contains 3 or fewer > characters. RTRIM() can always be used to trim the final > if not including it is too tricky - having an answer which gives both possibilities would help me to deepen my understanding of regexes!

This is not a duplicate - it's certainly related, but I'd like to correct the error in the original answer - and provide a correct answer of my own.

Upvotes: 1

Views: 315

Answers (2)

The fourth bird
The fourth bird

Reputation: 163207

You can repeat matching 0-3 times including the > using

^(?:[^>]*>){0,3}[^>]*
  • ^ Start of string
  • (?:[^>]*>){0,3} Repeat 0 - 3 times matching any character except > and then match >
  • [^>]* Optionally match any char except >

See a regex demo.

If there should be at least a single > then the quantifier can be {1,3}

Upvotes: 1

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626728

You can use

REGEXP_REPLACE(a, '^((?:[^>]*>){4}).*', '\1') 

See the regex demo. Details:

  • ^ - start of string
  • ((?:[^>]*>){4}) - Group 1 (\1): four sequences of any chars other than > and then a > char
  • .* - the rest of the line.

Here is a test:

CREATE TABLE s
(
  a TEXT
);
INSERT INTO s VALUES 
('afsad>adfsaf>asfasf>afasdX>asdffs>asfdf>'),
('23433>433453>4>4559>455>3433>'),
('adfd>adafs>afadsf>');
SELECT REGEXP_REPLACE(a, '^((?:[^>]*>){4}).*', '\1') as Output FROM s;

Output:

enter image description here

Upvotes: 1

Related Questions