PySpark jdbc predicates error: Py4JError: An error occurred while calling o108.jdbc

Question

I'm trying to use predicates in my DataFrameReader.jdbc() method:

df = sqlContext.read.jdbc(
    url="jdbc:db2://bluemix05.bluforcloud.com:50001/BLUDB:user=****;password=****;sslConnection=true;",  
    table="GOSALES.BRANCH",
    predicates=['WHERE BRANCH_CODE=5']
).cache()

However, I'm hitting the following error:

---------------------------------------------------------------------------
Py4JError                                 Traceback (most recent call last)
...

Py4JError: An error occurred while calling o108.jdbc. Trace:
py4j.Py4JException: Method jdbc([class java.lang.String, class java.lang.String, class [Ljava.lang.Object;, class java.util.Properties]) does not exist

How should I be adding predicates to the jdbc method call?

zero323 · Accepted Answer

There at least two problems here. One looks like a PySpark bug and as far as I can tell is already solved in the current master.

Another problem is condition you use. It should be simply 'BRANCH_CODE = 5' not 'WHERE BRANCH_CODE = 5'.

Finally if you use only a single predicate it makes more sense to pass it as subquery like this:

df = sqlContext.read.jdbc( 
    url = url,
    table = "(SELECT * FROM GOSALES.BRANCH WHERE BRANCH_CODE=5) AS tmp")

JDBC query with predicates creates a single JDBC partition per predicate so it is much harder to tune. Not to mention you have to remember about possible duplicates.

PySpark jdbc predicates error: Py4JError: An error occurred while calling o108.jdbc

Answers (1)

Related Questions