Leo
Leo

Reputation: 898

Adding new column in pyspark dataframe

I'm trying to add a new record Timezone to my pysaprk dataframe

from timezonefinder import TimezoneFinder
tf = TimezoneFinder()
df = df.withColumn("longitude",col("longitude").cast("float"))
df = df.withColumn("Latitude",col("Latitude").cast("float"))
df = df.withColumn("timezone",tf.timezone_at(lng=col("longitude"), lat=col("Latitude")))

I'm getting below error.

ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.

Timezonefinder library is used to find timezone by passing geocoordinates.

Latitude, longitude = 20.5061, 50.358
tf.timezone_at(lng=longitude, lat=Latitude)
 -- 'Asia/Riyadh'

Upvotes: 1

Views: 167

Answers (1)

mck
mck

Reputation: 42352

You need to use a UDF to pass columns to Python functions:

import pyspark.sql.functions as F

@F.udf('string')
def tfUDF(lng, lat):
    from timezonefinder import TimezoneFinder
    tf = TimezoneFinder()
    return tf.timezone_at(lng=lng, lat=lat)

df = df.withColumn("longitude", F.col("longitude").cast("float"))
df = df.withColumn("Latitude", F.col("Latitude").cast("float"))
df = df.withColumn("timezone", tfUDF(F.col("longitude"), F.col("Latitude")))

df.show()
+--------+---------+-----------+
|Latitude|longitude|   timezone|
+--------+---------+-----------+
| 20.5061|   50.358|Asia/Riyadh|
+--------+---------+-----------+

Upvotes: 2

Related Questions