rabejens
rabejens

Reputation: 8122

Calling JVM-based algorithms / functions from pySpark possible?

I created a set of algorithms and helpers in Scala for Spark working with different formats of measured data. They are all based on Hadoop's FileInputFormat. I also created some helpers to ease working with time series data from a Cassandra database. I now need some advanced functions which are already present in Thunder, plus some of my colleagues who are to work with these helper functions want to use Python. Is it somehow possible to use these helper functions from python or do I have to reimplement them?

I read through a lot of docs and only found that you can load extra jars with pyspark, but not how to use the functions.

Upvotes: 1

Views: 832

Answers (1)

rabejens
rabejens

Reputation: 8122

"By accident" I found the solution: It is the "Java Gateway". This is not documented in the Spark documentation (at least I didn't find it).

Here is how it works, using a "GregorianCalendar" as an example

j = sc._gateway.jvm
cal = j.java.util.GregorianCalendar()
print cal.getTimeInMillis()

However, passing the SparkContext does not work directly. The Java SparkContext is in the _jsc field:

ref = j.java.util.concurrent.atomic.AtomicReference()
ref.set(sc)

this fails. However:

ref = j.java.util.concurrent.atomic.AtomicReference()
ref.set(sc._jsc)

works.

However note that sc._jsc returns a Java-based Spark Context, i.e., a JavaSparkContext. To get the original Scala SparkContext, you have to use:

sc._jsc.sc()

Upvotes: 3

Related Questions