Redshift Python UDF runs on its own but throws an error when part using count or as part of another query

EDIT/UPDATE(BELOW) I stored and can successfully run a python UDF in AWS-Redshift. The UDF is taking lat/lon points and returning a boolean if the point is within a given distance of another given point.

When I run

SELECT dist_in_range(5000.0, latitude, longitude, 38.897957, -77.036560) as in_range 
from test_2;

It returns a column of true or false as expected.

When I run

SELECT a.in_range from (SELECT dist_in_range(5000.0, latitude, longitude, 38.897957, -77.036560) as in_range 
                             from test_2) as a
                             where a.in_range = false;

to filter on false, it again runs correctly.

If I add a count() function into the query like:

SELECT count(a.in_range) from (SELECT dist_in_range(5000.0, latitude, longitude, 38.897957, -77.036560) as in_range 
                             from test_2) as a
                             where a.in_range = false;

it returns the error:

[Amazon](500310) Invalid operation: TypeError: a float is required. Please look at svl_udf_log for more information Details: ----------------------------------------------- error: TypeError: a float is required. Please look at svl_udf_log for more information code: 10000 context: UDF query: 1766 location: udf_client.cpp:369 process: query1_995_1766 [pid=50711] -----------------------------------------------;

This error seems to point to it being an issue with the UDF and the UDF inputs, but as shown above the UDF is working correctly on its own. I would think using count() on the results would simply be a sql query counting the items returned as false. Why is this an error when trying to count the results of the UDF?

UPDATE/EDIT: I am starting to believe that this some type of precision error happening in python 2.7 (the version Redshift docs state that its using). Here is the UDF that I"m running (credit to https://skipperkongen.dk/category/spatial/ for code; i just made additions):

CREATE OR REPLACE FUNCTION dist_in_range (radius float,lat1 float, lon1 float, lat2 float, lon2 float)
RETURNS bool IMMUTABLE AS
$$

    from math import radians, sin, cos, asin, sqrt, pi, atan2
    import numpy as np

    earth_radius_miles = 3956.0

    def dist_in_range(radius, lat1, lon1, lat2, lon2):
        """checks if a point is within int number of miles of second set of points.
        """
        lat1, lon1 = radians(lat1), radians(lon1)
        lat2, lon2 = radians(lat2), radians(lon2)
        dlat, dlon = float(lat2 - lat1), float(lon2 - lon1)
        a = sin(dlat/2.0)**2 + cos(lat1) * cos(lat2) * sin(dlon/2.0)**2
        great_circle_distance = 2 * asin(min(1,sqrt(a)))
        if float(earth_radius_miles * great_circle_distance) < float(radius):
            return True
        else:
            return False

    return dist_in_range(radius, lat1, lon1, lat2, lon2)
$$ LANGUAGE plpythonu;

On the dataset I'm testing on, if I run this query:

SELECT dist_in_range(40, latitude, longitude, 20.652975, -87.102572) as in_range from test_2
where in_range = true;

It returns results with no error. If I lower the radius variable below 40 I start to get the "float is required" error, UNLESS I set the WHERE in_range = false, then it again returns results without error.

I was checking running smaller radii in a jupyter notebook and in some cases, while printing the calc steps out, I get really small numbers like 1.0134428420666964e-13 So, I'm wondering it this is a problem with precision in python 2.7 and if there is anything I can do to adjust for that?

Lastly, the log that the aws error references doesn't provide much more info as it just parrots the "TypeError: a float is required" message, and points to a line 11 and line 21 in the UDF but the line 11 is a comment and line 21 is the else: return False line.

Upvotes: 0

Answers (2)

Joe Harris

Reputation: 14045

Redshift now supports the GEOMETRY data type for spatial data and has 40+ high performance native functions.

Upvotes: 1

Zack

Reputation: 179

I had originally created and loaded the table in Redshift with the lat/lon data types specified as NUMERIC with (9,6) precision (I had seen this recommended for working with lat/lon types). I reloaded the table but changed the data types to FLOAT8 instead and now it works fine.

I had assumed incorrectly that a number with 6 digits after the decimal point would be viewed as a float but it was not.

Upvotes: 0

Redshift Python UDF runs on its own but throws an error when part using count or as part of another query

Answers (2)

Related Questions