Reputation: 898
I'm trying to add a new record Timezone to my pysaprk dataframe
from timezonefinder import TimezoneFinder
tf = TimezoneFinder()
df = df.withColumn("longitude",col("longitude").cast("float"))
df = df.withColumn("Latitude",col("Latitude").cast("float"))
df = df.withColumn("timezone",tf.timezone_at(lng=col("longitude"), lat=col("Latitude")))
I'm getting below error.
ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.
Timezonefinder library is used to find timezone by passing geocoordinates.
Latitude, longitude = 20.5061, 50.358
tf.timezone_at(lng=longitude, lat=Latitude)
-- 'Asia/Riyadh'
Upvotes: 1
Views: 167
Reputation: 42352
You need to use a UDF to pass columns to Python functions:
import pyspark.sql.functions as F
@F.udf('string')
def tfUDF(lng, lat):
from timezonefinder import TimezoneFinder
tf = TimezoneFinder()
return tf.timezone_at(lng=lng, lat=lat)
df = df.withColumn("longitude", F.col("longitude").cast("float"))
df = df.withColumn("Latitude", F.col("Latitude").cast("float"))
df = df.withColumn("timezone", tfUDF(F.col("longitude"), F.col("Latitude")))
df.show()
+--------+---------+-----------+
|Latitude|longitude| timezone|
+--------+---------+-----------+
| 20.5061| 50.358|Asia/Riyadh|
+--------+---------+-----------+
Upvotes: 2