experiment
experiment

Reputation: 315

TypeError: a float is required pyspark

I aim to calculate the haversine distance of the lat-long data ,I have used uci dataset for the same https://archive.ics.uci.edu/ml/datasets/GPS+Trajectories (go_track_trackspoints.csv) I have used below code to calculate the distance

def dist(lon2, lat2,lon1, lat1):
            phi_1=toRadians(lat1)
            phi_2=toRadians(lat2)
            delta_phi=toRadians(lat2-lat1)
            delta_lambda=toRadians(lon2-lon1)


            a=sin(delta_phi/2.0)**2+cos(phi_1)*cos(phi_2)*sin(delta_lambda/2.0)**2
            c=2*atan2(sqrt(abs(a)),sqrt(abs((1-a))))
            return c * 6372.8

And Schema is

root
 |-- id: string (nullable = true)
 |-- latitude: string (nullable = true)
 |-- longitude: string (nullable = true)
 |-- track_id: string (nullable = true)
 |-- time: string (nullable = true)

I have loaded the data into a spark dataframe (gps_data)

 +---+-----------------+-----------------+--------+-------------------+
    | id|         latitude|        longitude|track_id|               time|
    +---+-----------------+-----------------+--------+-------------------+
    |  1|-10.9393413858164|-37.0627421097422|       1|2014-09-13 07:24:32|
    |  2| -10.939341385769|-37.0627421097809|       1|2014-09-13 07:24:37|
    |  3|-10.9393239478718|-37.0627645137212|       1|2014-09-13 07:24:42|
    |  4|-10.9392105616561|-37.0628430455445|       1|2014-09-13 07:24:47|
    +---+-----------------+-----------------+--------+-------------------+

Using the below command I am trying to get a column of distance

my_window = Window.partitionBy().orderBy("time")  
gps_d=gps_data.withColumn("dist", dist(
        "longitude", "latitude",
        lag("longitude", 1).over(my_window), lag("latitude", 1).over(my_window)
    ).alias("dist"))

But I am not able to solve an error and not able to find a solution either .Please help me!

Error is :

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-219-6352041ff223> in <module>()
      1 gps_d=gps_data.withColumn("dist", dist(
      2     "longitude", "latitude",
----> 3     lag("longitude", 1).over(my_window), lag("latitude", 1).over(my_window)
      4 ).alias("dist"))

<ipython-input-218-2f781ab3b2fb> in dist(lon2, lat2, lon1, lat1)
      9 
     10         a=sin(delta_phi/2.0)**2+cos(phi_1)*cos(phi_2)*sin(delta_lambda/2.0)**2
---> 11         c=2*atan2(sqrt(a),sqrt(1-a))
     12         return c * 6372.8

TypeError: a float is required

PS: I have checked that none of the column is null when loading csv to dataframe

Upvotes: 0

Views: 1488

Answers (1)

Ramesh Maharjan
Ramesh Maharjan

Reputation: 41987

The error didn't get replicated in mine but the output was all null too. So I passed columns instead of string and it worked as

gps_data.withColumn("dist", dist(
    col("longitude"), col("latitude"),
    lag("longitude", 1).over(my_window), lag("latitude", 1).over(my_window)
).alias("dist")).show(truncate=False)

which gave output as

+---+-----------------+-----------------+--------+-------------------+---------------------+
|id |latitude         |longitude        |track_id|time               |dist                 |
+---+-----------------+-----------------+--------+-------------------+---------------------+
|1  |-10.9393413858164|-37.0627421097422|1       |2014-09-13 07:24:32|null                 |
|2  |-10.939341385769 |-37.0627421097809|1       |2014-09-13 07:24:37|6.756720378438061E-9 |
|3  |-10.9393239478718|-37.0627645137212|1       |2014-09-13 07:24:42|0.0031221549946508337|
|4  |-10.9392105616561|-37.0628430455445|1       |2014-09-13 07:24:47|0.01525123103019258  |
+---+-----------------+-----------------+--------+-------------------+---------------------+

I hope its helpful

Upvotes: 1

Related Questions