Reputation: 1981
I am creating a spark job that requires a column to be added to a dataframe using a function written in python. The rest of the processing is done using Scala.
I have found examples of how to call a Java/Scala function from pyspark:
The only examples I have found to send data the other way is using pipe
Is it possible for me to send the entire dataframe to a python function, have the function manipulate the data and add additional columns and then send the resulting dataframe back to the calling Scala function?
If this isn't possible my current solution is to run a pyspark process and call multiple Scala functions to manipulate the dataframe, this isn't ideal.
Upvotes: 5
Views: 8186
Reputation: 519
Just register a UDF from Python, and then from Scala evaluate an SQL statement that uses the function against a DataFrame - works like a charm, just tried it ;) https://github.com/jupyter/docker-stacks/tree/master/all-spark-notebook is a good way to run a notebook in Toree that mixes Scala and Python code calling the same Spark context.
Upvotes: 1
Reputation: 1032
I found this post:
Machine Learning with Jupyter using Scala, Spark and Python: The Setup
It shows you how to set up a Jupyter notebook that used both Spark and Python. If you are just experimenting with data that might be enough.
Upvotes: 0