Reputation: 878
I am using PySpark on Spark 1.5 on Cloudera YARN, using Python 3.3 on Centos 6 Machines. The SQL Server instance is SQL Server Enterprise 64bit. The SQL Server driver is listed below; sqljdbc4.jar; and I have added to my .bashrc
export SPARK_CLASSPATH="/var/lib/spark/sqljdbc4.jar"
export PYSPARK_SUBMIT_ARGS="--conf spark.executor.extraClassPath="/var/lib/spark/sqljdbc4.jar" --driver-class-path="/var/lib/spark/sqljdbc4.jar" --jars="/var/lib/spark/sqljdbc4.jar" --master yarn --deploy-mode client"
And I can see confirmation when I launch Spark that
SPARK_CLASSPATH was detected (set to '/var/lib/spark/sqljdbc4.jar')
I have a dataframe that looks like this schema
root
|-- daytetime: timestamp (nullable = true)
|-- ip: string (nullable = true)
|-- tech: string (nullable = true)
|-- th: string (nullable = true)
|-- car: string (nullable = true)
|-- min_dayte: timestamp (nullable = true)
|-- max_dayte: timestamp (nullable = true)
I have created an empty table already in my MS SQL server called 'dbo.shaping', where the 3 timestamp columns will be datetime2(7) and the others nvarchar(50).
I try to export the dataframe from PySpark using this
properties = {"user": "<username>", "password": "<password>"}
df.write.format('jdbc').options(url='<IP>:1433/<dbname>', dbtable='dbo.shaping',driver="com.microsoft.sqlserver.jdbc.SQLServerDriver",properties=properties)
I get the following traceback error
Py4JError: An error occurred while calling o250.option. Trace:
py4j.Py4JException: Method option([class java.lang.String, class java.util.HashMap]) does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:333)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:342)
at py4j.Gateway.invoke(Gateway.java:252)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:744)
Are my methods at least correct, and perhaps this error is related to writing the specific type of data, ie, I have an issue with the data construct and not my code?
Upvotes: 12
Views: 1912
Reputation: 330353
You cannot use a dict
as a value for options
. options
method expects only str
arguments (Scala docs and PySpark annotations) and is expanded to separate calls to Java option
.
In current Spark versions value is automatically converted to string, so your code would fail silently, but it isn't the case in 1.5.
Since properties
are specific to JDBC driver anyway, you should use jdbc
method:
properties = {
"user": "<username>", "password": "<password>", "driver":
"com.microsoft.sqlserver.jdbc.SQLServerDriver"}
df.write.jdbc(
url='<IP>:1433/<dbname>',
table='dbo.shaping',
properties=properties)
though unpacking properties should work as well:
.options(
url='<IP>:1433/<dbname>',
dbtable='dbo.shaping',
driver="com.microsoft.sqlserver.jdbc.SQLServerDriver",
**properties)
In general, when you see:
py4j.Py4JException: Method ... does not exist
it usually signalizes mismatch between local Python types, and the types expected by JVM method in use.
See also: How to use JDBC source to write and read data in (Py)Spark?
Upvotes: 6