user14696
user14696

Reputation: 677

Regex failing to match number and dash with letter (or space and letter)

In the tester this works ... but not in PostgreSQL.

My data is like this -- usually a series of letters, followed by 2 numbers and a POSSIBLE '-' or 'space' with only ONE letter following. I am trying to isolate the 2 numbers and the Possible '-" or 'space' AND the ONE letter with my regex:

For ex:

AJ 50-R Busboys ## should return 50-R
APPLES 30 F ## should return 30 F 
FOOBAR 30 Apple ## should return 30

Regex's (that have worked in the tester, but not in PostgreSQL) that I've tried:

substring(REF from '([0-9]+)-?([:space:])?([A-Za-z])?') 

&

substring(REF from '([0-9]+)-?([A-Za-z])?') 

So far everything tests out in the tester...but not the PostgreSQL. I just keep getting the numbers returns -- AND NOTHING AFTER IT.

What I am getting now(for ex):

AJ 50-R Busboys ## returns as "50" NOT as "50-R"

Upvotes: 2

Views: 3440

Answers (3)

Erwin Brandstetter
Erwin Brandstetter

Reputation: 656942

This matches your description and examples.
Your description is slightly ambiguous. Leading letters are followed by a space and then two digits in your examples, as opposed to your description.

SELECT t, substring(t, '^[[:alpha:] ]+(\d\d(:?[\s-]?[[:alpha:]]\M)?)')
FROM  (
   VALUES
     ('AJ 50-R Busboys')       -- should return: 50-R
    ,('APPLES 30 F')           -- should return: 30 F 
    ,('FOOBAR 30 Apple')       -- should return: 30
    ,('FOOBAR 30x Apple')      -- should return: 30x
    ,('sadfgag30 D 66 X foo')  -- should return: 30 D - not: 66 X
   ) r(t);

->SQLfiddle

Explanation

^ .. start of string (last row could fail without anchoring to start and global flag 'g'). Also: faster.
[[:alpha:] ]+ .. one or more letters or spaces (like in your examples).
( .. capturing parenthesis
\d\d .. two digits
(:? .. non-capturing parenthesis
[\s-]? .. '-' or 'white space' (character class), 0 or 1 times
[[:alpha:]] .. 1 letter
\M .. followed by end of word (can be end of string, too)
)? .. the pattern in non-capturing parentheses 0 or 1 times

Letters as defined by the character class alpha according to the current locale! The poor man's substitute [a-zA-Z] only works for basic ASCII letters and fails for anything more. Consider this simple demo:

SELECT  substring('oö','[[:alpha:]]*')
       ,substring('oö','[a-zA-Z]*');

More about character classes in Postgres regular expressions in the manual.

Upvotes: 2

Matthew Plourde
Matthew Plourde

Reputation: 44614

Your looking for: substring(REF from '([0-9]+(-| )([A-Za-z]\y)?)')

In SQLFiddle. Your primary problem is that substring returns the first or outermost matching group (ie., pattern surrounded with ()), which is why you get 50 for your '50-R'. If you were to surround the entire pattern with (), this would give you '50-R'. However, the pattern you have fails to return what you want on the other strings, even after accounting for this issue, so I had to modify the entire regex.

Upvotes: 2

Cristian Lupascu
Cristian Lupascu

Reputation: 40546

It's because of the parentheses.

I've looked everywhere in the documentation and found an interesting sentence on this page:

[...] if the pattern contains any parentheses, the portion of the text that matched the first parenthesized subexpression (the one whose left parenthesis comes first) is returned.

I took your first expression:

([0-9]+)-?([:space:])?([A-Za-z])?

and wrapped it in parentheses:

(([0-9]+)-?([:space:])?([A-Za-z])?)

and it works fine (see SQLFiddle).

Update:

Also, because you're looking for - or space, you could rewrite your middle expression to [-|\s]? (thanks Matthew for pointing that out), which leads to the following possible REGEX:

(([0-9]+)[-|\s]?([A-Za-z])?)

(SQLFiddle)

Update 2:

While my answer provides the explanation as to why the result represented a partial match of your expression, the expression I presented above fails your third test case.

You should use the regex provided by Matthew in his answer.

Upvotes: 1

Related Questions