Reputation: 78
I'm trying to calculate distance in kilometers between two geographical coordinates using the haversine formula in Spark 2.3 in Scala 2.11.8.
I want to compute the distances over users between two movements:
I have Longitude and Latitude, the idea is to get the distance in KM.
+-----------+------------------+------------------+-----------------+
| user| distance |Longitude_Centroid|Latitude_Centroid|
+-----------+------------------+------------------+-----------------+
|-2525 | null| 7.038245640847997|39.48919886182785|
|-2147 |12818.567585128396| 7.038245640847997|39.48919886182785|
|-2147 |12818.567585128396| 7.038245640847997|39.48919886182785|
|-2525 |12862.278795753988| 7.050538333095536|39.49362379246508|
It worked fine for me using Python DataFrame however I am struggling in Scala Spark !
I used the following code, but it seems that it is not working properly.
df4.withColumn("a", pow(sin(( lag($"Latitude_Centroid", 1).over(window) -
$"Latitude_Centroid") / 2), 2) + cos(($"Latitude_Centroid")) *
cos((lag($"Latitude_Centroid", 1).over(window)) *
pow(sin((lag($"Longitude_Centroid", 1).over(window) -
$"Longitude_Centroid") / 2), 2))).withColumn("distance", atan2(sqrt($"a"),
sqrt(-$"a" + 1)) * 2 * 6371).select("imei","distance","Longitude_Centroid","Latitude_Centroid").show(50)
Upvotes: 1
Views: 1927
Reputation: 78
Just found the solution
df4.withColumn("lat_lag", lag($"Latitude_Centroid", 1).over(window)).withColumn("lng_lag", lag($"Longitude_Centroid", 1).over(window)).select("imei","lat_lag","lng_lag","date_from","Longitude_Centroid","Latitude_Centroid") .withColumn("a", pow(sin(toRadians($"Latitude_Centroid" - $"lat_lag") / 2), 2) + cos(toRadians($"lat_lag")) * cos(toRadians($"Latitude_Centroid")) * pow(sin(toRadians($"Longitude_Centroid" - $"lng_lag") / 2), 2)) .withColumn("distance", atan2(sqrt($"a"), sqrt(-$"a" + 1)) * 2 * 6371) .select("imei","lat_lag","lng_lag","date_from","Longitude_Centroid","Latitude_Centroid","distance") .show()
Upvotes: 3