darthtye
darthtye

Reputation: 19

Redshift/Postgres between function produces seemingly unrelated error when ">" or "<" both work but not together

I am trying to filter out data that is stored as strings but represents numeric values. I filter out non-numeric strings then convert to a float. I am then trying to filter values above or below a specific amount. Below is a simplified version of my query that works with the filters disabled

WITH attributes AS (

    SELECT
        property_id,
        CASE WHEN regexp_replace(value, '[^0-9.]', '')='' THEN
            null
        WHEN regexp_replace(value, '[^0-9.]', '') = '.' THEN NULL
        ELSE
            regexp_replace(value, '[^0-9.]', '')::float
        END AS sqft, 'platform' FROM mv_prop_attributes
        WHERE
            display_name = 'Livable Area'
            and sqft is not null
            and length(sqft)>0
)

    SELECT
            *
        FROM
            attributes
        WHERE true
--          and sqft < 100000 --works
--          and sqft>200 --works
--          and sqft < 100000 and sqft>200 --does not work
--          and sqft between 200 and 100000 --does not work

As noted in the commented portions I can filter above or below a value but not a range. Attempts to do so produce the following error: ERROR: Invalid digit, Value '.', Pos 0, Type: Double DETAIL:

error: Invalid digit, Value '.', Pos 0, Type: Double code: 1207 context: . query: 438646641 location: :0 process: query0_117_438646641 [pid=0]

This is running on an up to date redshift cluster. Thank you for any suggestions on how to resolve this at the DB level.

Upvotes: 0

Views: 89

Answers (3)

darthtye
darthtye

Reputation: 19

Adjustments to regex failed to work. After making additional adjustments to the query I found a solution:

WITH attributes AS (

       SELECT
        property_id,
        CASE WHEN regexp_replace(value, '[^0-9.]', '')='' THEN
            null
        WHEN regexp_replace(value, '[^0-9.]', '') = '.' THEN NULL
        when display_name = 'Livable Area' then
            regexp_replace(value, '[^0-9.]', '')::float
        END AS sqft, 'platform' FROM mv_prop_attributes
        WHERE
            display_name = 'Livable Area'
            and sqft is not null
            and length(sqft)>0
)

    SELECT
            *
        FROM
            attributes
        WHERE true
        --          and sqft < 100000 --works
--          and sqft>200 --works
          and sqft < 100000 and sqft>200
--          and sqft between 200 and 100000 --does not work

Based on this working I can only come to the conclusion redshift is evaluating the column values for all the data then afterwards implementing the where clause to filter results. This results in bad rows being passed to the function even though the where clause should have filtered them out. By reimplementing the where clause in the field definition case statement it filtered bad data out at 2 points. This explains why when removing all the regex it looked fine with the where clause enabled.

Upvotes: 0

Bill Weiner
Bill Weiner

Reputation: 11082

I think the problem is likely that you didn't account for a period (vs. a decimal point). The string "one thousand square feet." would come through your SQL as a single '.' which cannot be converted to a float.

I suspect that you aren't getting the error with inequality as Redshift can push down WHERE clauses where it can (but this case seems extreme). Inequality can be applied to string - '1' < '2' - but this doesn't work the same for ranges. This could be letting the query run but is a long shot.

==================================================================

A way to test if the source data has extra '.'s in it would be to run:

SELECT count(1) 
FROM mv_prop_attributes
WHERE value like '%.%.%';

Upvotes: 0

ValNik
ValNik

Reputation: 5926

Try regexp_replace(value, '[^0-9.]', '',1,0,'g') - replace all non numeric symbols and '.', not only first.

WITH attributes AS (

    SELECT
        property_id,
        CASE WHEN regexp_replace(value, '[^0-9.]', '',1,0,'g')='' THEN
            null
        WHEN regexp_replace(value, '[^0-9.]', '',1,0,'g') = '.' THEN NULL
        ELSE
            regexp_replace(value, '[^0-9.]', '',1,0,'g')::float
        END AS sqft, 'platform' FROM mv_prop_attributes
        WHERE
            display_name = 'Livable Area'
            and sqft is not null
            and length(sqft)>0
)

    SELECT
            *
        FROM
            attributes
        WHERE true
--          and sqft < 100000 --works
--          and sqft>200 --works
          and sqft < 100000 and sqft>200 --does not work
          and sqft between 200 and 100000 --does not work

Upvotes: 0

Related Questions