Reputation: 109
Using PySpark SQL and given 3 columns, I would like to create an additional column that divides two of the columns, the third one being an ID column.
df = sqlCtx.createDataFrame(
[
(1, 4, 2),
(2, 5, 2),
(3, 10, 4),
(4, 50, 10)
],
('ID', 'X', 'Y')
)
This is the desired output:
+----+----+----+---------------------+
| ID | x | y | z (expected result) |
+----+----+----+---------------------+
| 1 | 4 | 2 | 2 |
| 2 | 5 | 2 | 2.5 |
| 3 | 10 | 4 | 2.5 |
| 4 | 50 | 10 | 5 |
+----+----+----+---------------------+
To do so, I have created an UDF:
def createDivision(args):
X = float(args[0])
Y = float(args[1])
RESULT = X / Y
return RESULT
udf_createDivision = udf(createDivision, FloatType())
udf_createDivision_calc = udf_createDivision(df['X'], df['Y'])
df = df.withColumn("Z", udf_createDivision_calc)
df.show()
Then I get a long error in the output:
Py4JJavaError: An error occurred while calling o7401.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 756.0 failed 1 times, most recent failure: Lost task 0.0 in stage 756.0 (TID 7249, localhost, executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/opt/spark/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 372, in main
process()
File "/opt/spark/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 367, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/opt/spark/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 243, in <lambda>
func = lambda _, it: map(mapper, it)
File "<string>", line 1, in <lambda>.......
I would very much appreciate some help, because I don't know how to interpret the error. Thanks.
Upvotes: 3
Views: 19417
Reputation: 116
Just expressions:
from pyspark.sql.functions import col
df.withColumn("Z", col("x") / col("y"))
As of your code (you really shouldn't use udf here) it should be either:
def createDivision(x, y):
return x / y
or
def createDivision(*args):
return args[0] / args[1]
Upvotes: 10