Using PySpark sql functions

Question

This function :

from pyspark.sql import functions as F
lg = F.log(5.2)

from http://spark.apache.org/docs/latest/api/python/pyspark.sql.html

returns :

Py4JError: An error occurred while calling z:org.apache.spark.sql.functions.col. Trace:
py4j.Py4JException: Method col([class java.lang.Double]) does not exist
    at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
    at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:339)
    at py4j.Gateway.invoke(Gateway.java:274)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:214)
    at java.lang.Thread.run(Thread.java:745)

The documentation points to using the function within a dataframe :

>>> df.select(log(10.0, df.age).alias('ten')).rdd.map(lambda l: str(l.ten)[:7]).collect()
['0.30102', '0.69897']
>>> df.select(log(df.age).alias('e')).rdd.map(lambda l: str(l.e)[:7]).collect()
['0.69314', '1.60943']

Should also have ability to use log function independently on a value ?

Rags · Accepted Answer

The functions in pyspark.sql should be used on dataframe columns. These functions expect a column to be passed as parameter. Hence it is looking for a column object with the name that you are passing (5.2 in this case) and hence the error.

For applying log on any value you should be using math.log instead

Using PySpark sql functions

Answers (1)

Related Questions