Reputation: 2195
I want to call the main
method of the FAT jar from Pyspark.
Here is the entrypoint of the main method of the jar (Scala):
object Main {
def main(args: Array[String]): Unit = {
// codes
}
}
In order to invoke above method, I need to create an Array[String]
using pyspark's py4j:
str_array = sc._jvm.java.lang.reflect.Array.newInstance(sc._jvm.java.lang.String, 3)
str_array[0] = "228"
loaded_class = sc._jvm.java.lang.Thread.currentThread().getContextClassLoader().loadClass("com.mycompany.Main")
loaded_class.main(str_array)
And this is the error I get:
Py4JError: java.lang.String._get_object_id does not exist in the JVM
With plain Py4j, I could have created string array using:
from py4j.java_gateway import JavaGateway
gateway = JavaGateway()
gateway.new_array(gateway.jvm.java.lang.String, 4)
I tried to pass object array to main, but that didn't work:
ob = sc._jvm.java.lang.Object()
ob_array = sc._jvm.java.lang.reflect.Array.newInstance(ob.getClass(), 3)
ob_array[0] = "228"
loaded_class = sc._jvm.java.lang.Thread.currentThread().getContextClassLoader().loadClass("com.mycompany.Main")
loaded_class.main(ob_array)
fails with error:
Py4JError: An error occurred while calling o516.main. Trace:
py4j.Py4JException: Method main([class [Ljava.lang.Object;]) does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
at py4j.Gateway.invoke(Gateway.java:274)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
How do I create the string array to invoke main method in PySpark ?
Upvotes: 1
Views: 1542
Reputation: 1705
To convert some python list of strings (arr
) to java string array in pyspark you can use the function toJStringArray
bellow:
def toJStringArray(arr):
jarr = sc._gateway.new_array(sc._jvm.java.lang.String, len(arr))
for i in range(len(arr)):
jarr[i] = arr[i]
return jarr
# Usage example:
java_arr = toJStringArray(['string1'])
I run this code in the azure databricks. So the sc
is a predefined global for the spark context.
Upvotes: 2
Reputation: 11
Similar to the plain Py4j, can you try creating it using :
str_array = sc._jvm._gateway.new_array(sc._jvm.java.lang.String, 4)
Upvotes: 1