pmiranda
pmiranda

Reputation: 8470

Filter strings with regex before casting to numeric

I have this code (was already there, isn't mine):

SELECT
    a.id_original_contrato AS contrato,
    ( CASE WHEN d.value~'^\\d+$' THEN d.value::integer ELSE 0 END ) AS monto,
    EXTRACT(YEAR FROM b.value)::integer AS anoinicio,
    EXTRACT(YEAR FROM c.value)::integer AS anofin

... etc (some JOIN's and WHERE's)

Let me explain: d.value comes from a table where value is character varying (200). The code will insert later the d.value (now called 'monto') in another table as a integer. Someone coded that regex in order to extract some chars or in other case (ELSE), define it as 0. Those values works when they are integer only. If I use a d.value like 76.44 it doesn't work due that regex, it always define it as 0.

Well, I have to change that code, because:

I'm not sure what that regex is doing. And how could I accomplish what need with a better or new regex?

Upvotes: 2

Views: 4152

Answers (2)

Erwin Brandstetter
Erwin Brandstetter

Reputation: 659367

The double backslash in \\d suggests an old version with standard_conforming_strings = off. The manual:

Beginning in PostgreSQL 9.1, the default is on (prior releases defaulted to off).

In modern versions with standard_conforming_strings = on, this string makes little sense as regular expression: '^\\d+$'. To detect strings consisting of one or more digits use either E'^\\d+$' (prefixed with E) or '^\d+$'. Details:

Integer literals also allow an optional leading sign for negative / positive numbers, and leading / dangling white space in Postgres.
So, this is the complete regular expression for valid integer literals:

CASE WHEN d.value ~ '^\s*[-+]?\d+\s*$' THEN d.value::int ELSE 0 END

The regular expression explained:

^ .. start of string
\s .. class shorthand for [[:space:]] (white space)
* .. quantifier for 0 or more times
[+-] .. character class consisting of + and -
? .. quantifier for 0 or 1 times
\d .. class shorthand for [[:digit:]] (digits)
+ .. quantifier for 1 or more times
\s* .. same as above
$ .. end of string

Consider the syntax rules for numeric string literals. One essential quote:

There cannot be any spaces or other characters embedded in the constant

That's because a numeric constant is not quoted, hence white space is not possible. Not applicable for casting strings. White space is tolerated:

Leading, trailing and right after the exponent char.

So these are all legal strings for the cast to numeric:

'^\s*[-+]?\d*\.?\d+(?:[eE]\s*[-+]?\d+)?\s*$'

The only new element are parentheses (()) to denote the contained regular expression as atom. Since we are not interested in back references, use "non-capturing": (?:...) and append a question mark (?:[eE]\s*[-+]?\d+)? to mean: the "exponential" part can be added or not, as a whole.

Assuming dot (.) as decimal separator. You might use comma instead (,) or [,\.] to allow either. But only dot is legal for the cast.

Test:

SELECT '|' || lit || '|' AS text_with_delim
     , lit ~ '^\s*[-+]?\d*\.?\d+([eE]\s*[-+]?\d+)?\s*$' AS valid
     , lit::numeric AS number
FROM   unnest ('{1
               , 123
               , 000
               , "  -1     "
               , +2
               , 1.2
               , .34
               , 5e6
               , " .5e   -6  "
                }'::text[]) lit;

Result:

text_with_delim valid number
|1| t 1
|123| t 123
|000| t 0
| -1 | t -1
|+2| t 2
|1.2| t 1.2
|.34| t 0.34
|5e6| t 5000000
| .5e -6 | t 0.0000005

Or you might have use for to_number() to convert strings of arbitrary given format.

Upvotes: 5

klin
klin

Reputation: 121919

Choose a variant:

with v(value) as (
    values
    ('12,3'),
    ('12.3'),
    ('123'),
    ('123.'),
    ('.123'),
    ('1.2.3')
    )

select 
    value, 
    value ~ '^(\d+[,\.]\d+|\d+)$' as variant_a,
    value ~ '^(\d*[,\.]\d*|\d+)$' as variant_b,
    value ~ '^\d+[,\.]\d+$' as variant_c
from v;

 value | variant_a | variant_b | variant_c 
-------+-----------+-----------+-----------
 12,3  | t         | t         | t
 12.3  | t         | t         | t
 123   | t         | t         | f
 123.  | f         | t         | f
 .123  | f         | t         | f
 1.2.3 | f         | f         | f
(6 rows)

To convert a string with a dot or a comma to numeric use replace():

select replace(value, ',', '.')::numeric;   

Upvotes: 2

Related Questions