PR102012
PR102012

Reputation: 878

PySpark 1.5 & MSSQL jdbc

I am using PySpark on Spark 1.5 on Cloudera YARN, using Python 3.3 on Centos 6 Machines. The SQL Server instance is SQL Server Enterprise 64bit. The SQL Server driver is listed below; sqljdbc4.jar; and I have added to my .bashrc

export SPARK_CLASSPATH="/var/lib/spark/sqljdbc4.jar"
export PYSPARK_SUBMIT_ARGS="--conf spark.executor.extraClassPath="/var/lib/spark/sqljdbc4.jar" --driver-class-path="/var/lib/spark/sqljdbc4.jar" --jars="/var/lib/spark/sqljdbc4.jar" --master yarn --deploy-mode client"

And I can see confirmation when I launch Spark that

SPARK_CLASSPATH was detected (set to '/var/lib/spark/sqljdbc4.jar')

I have a dataframe that looks like this schema

root
 |-- daytetime: timestamp (nullable = true)
 |-- ip: string (nullable = true)
 |-- tech: string (nullable = true)
 |-- th: string (nullable = true)
 |-- car: string (nullable = true)
 |-- min_dayte: timestamp (nullable = true)
 |-- max_dayte: timestamp (nullable = true)

I have created an empty table already in my MS SQL server called 'dbo.shaping', where the 3 timestamp columns will be datetime2(7) and the others nvarchar(50).

I try to export the dataframe from PySpark using this

properties = {"user": "<username>", "password": "<password>"} 

df.write.format('jdbc').options(url='<IP>:1433/<dbname>', dbtable='dbo.shaping',driver="com.microsoft.sqlserver.jdbc.SQLServerDriver",properties=properties)

I get the following traceback error

Py4JError: An error occurred while calling o250.option. Trace:
py4j.Py4JException: Method option([class java.lang.String, class java.util.HashMap]) does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:333)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:342)
at py4j.Gateway.invoke(Gateway.java:252)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:744)

Are my methods at least correct, and perhaps this error is related to writing the specific type of data, ie, I have an issue with the data construct and not my code?

Upvotes: 12

Views: 1912

Answers (1)

zero323
zero323

Reputation: 330353

You cannot use a dict as a value for options. options method expects only str arguments (Scala docs and PySpark annotations) and is expanded to separate calls to Java option.

In current Spark versions value is automatically converted to string, so your code would fail silently, but it isn't the case in 1.5.

Since properties are specific to JDBC driver anyway, you should use jdbc method:

properties = {
    "user": "<username>", "password": "<password>", "driver": 
    "com.microsoft.sqlserver.jdbc.SQLServerDriver"}

df.write.jdbc(
    url='<IP>:1433/<dbname>',
    table='dbo.shaping',
    properties=properties)

though unpacking properties should work as well:

.options(
    url='<IP>:1433/<dbname>',
    dbtable='dbo.shaping',
    driver="com.microsoft.sqlserver.jdbc.SQLServerDriver",
    **properties)

In general, when you see:

py4j.Py4JException: Method ... does not exist

it usually signalizes mismatch between local Python types, and the types expected by JVM method in use.

See also: How to use JDBC source to write and read data in (Py)Spark?

Upvotes: 6

Related Questions