Reputation: 19
I am trying to filter out data that is stored as strings but represents numeric values. I filter out non-numeric strings then convert to a float. I am then trying to filter values above or below a specific amount. Below is a simplified version of my query that works with the filters disabled
WITH attributes AS (
SELECT
property_id,
CASE WHEN regexp_replace(value, '[^0-9.]', '')='' THEN
null
WHEN regexp_replace(value, '[^0-9.]', '') = '.' THEN NULL
ELSE
regexp_replace(value, '[^0-9.]', '')::float
END AS sqft, 'platform' FROM mv_prop_attributes
WHERE
display_name = 'Livable Area'
and sqft is not null
and length(sqft)>0
)
SELECT
*
FROM
attributes
WHERE true
-- and sqft < 100000 --works
-- and sqft>200 --works
-- and sqft < 100000 and sqft>200 --does not work
-- and sqft between 200 and 100000 --does not work
This is running on an up to date redshift cluster. Thank you for any suggestions on how to resolve this at the DB level.
Upvotes: 0
Views: 89
Reputation: 19
Adjustments to regex failed to work. After making additional adjustments to the query I found a solution:
WITH attributes AS (
SELECT
property_id,
CASE WHEN regexp_replace(value, '[^0-9.]', '')='' THEN
null
WHEN regexp_replace(value, '[^0-9.]', '') = '.' THEN NULL
when display_name = 'Livable Area' then
regexp_replace(value, '[^0-9.]', '')::float
END AS sqft, 'platform' FROM mv_prop_attributes
WHERE
display_name = 'Livable Area'
and sqft is not null
and length(sqft)>0
)
SELECT
*
FROM
attributes
WHERE true
-- and sqft < 100000 --works
-- and sqft>200 --works
and sqft < 100000 and sqft>200
-- and sqft between 200 and 100000 --does not work
Based on this working I can only come to the conclusion redshift is evaluating the column values for all the data then afterwards implementing the where clause to filter results. This results in bad rows being passed to the function even though the where clause should have filtered them out. By reimplementing the where clause in the field definition case statement it filtered bad data out at 2 points. This explains why when removing all the regex it looked fine with the where clause enabled.
Upvotes: 0
Reputation: 11082
I think the problem is likely that you didn't account for a period (vs. a decimal point). The string "one thousand square feet." would come through your SQL as a single '.' which cannot be converted to a float.
I suspect that you aren't getting the error with inequality as Redshift can push down WHERE clauses where it can (but this case seems extreme). Inequality can be applied to string - '1' < '2' - but this doesn't work the same for ranges. This could be letting the query run but is a long shot.
==================================================================
A way to test if the source data has extra '.'s in it would be to run:
SELECT count(1)
FROM mv_prop_attributes
WHERE value like '%.%.%';
Upvotes: 0
Reputation: 5926
Try regexp_replace(value, '[^0-9.]', '',1,0,'g') - replace all non numeric symbols and '.', not only first.
WITH attributes AS (
SELECT
property_id,
CASE WHEN regexp_replace(value, '[^0-9.]', '',1,0,'g')='' THEN
null
WHEN regexp_replace(value, '[^0-9.]', '',1,0,'g') = '.' THEN NULL
ELSE
regexp_replace(value, '[^0-9.]', '',1,0,'g')::float
END AS sqft, 'platform' FROM mv_prop_attributes
WHERE
display_name = 'Livable Area'
and sqft is not null
and length(sqft)>0
)
SELECT
*
FROM
attributes
WHERE true
-- and sqft < 100000 --works
-- and sqft>200 --works
and sqft < 100000 and sqft>200 --does not work
and sqft between 200 and 100000 --does not work
Upvotes: 0