pyspark udf with multiple arguments

Question

I am using a python function to calculate distance between two points given the longitude and latitude.

def haversine(lon1, lat1, lon2, lat2):

    lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])

    newlon = lon2 - lon1
    newlat = lat2 - lat1

    haver_formula = np.sin(newlat/2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(newlon/2.0)**2

    dist = 2 * np.arcsin(np.sqrt(haver_formula))
    miles = 3958 * dist 
    return miles

My dataframe has 4 columns - lat, long, merch_lat, merch_long.

When I create a UDF like this, it throws me error. I don't know where I am going wrong.

udf_haversine = udf(haversine, FloatType())
data = data.withColumn("distance", udf_haversine("long", "lat", "merch_long","merch_lat"))

error is:

An error occurred while calling o1499.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure:

How to create a udf that takes multiple columns and returns a single value?

pyspark udf with multiple arguments

Answers (1)

Related Questions