pyspark read format jdbc generates ORA-00903: invalid table name Error

Question

With a pysqpark running on a remote server, I am able to connect to an Oracle database on another server with jdbc, but any valid query I run returns a ORA-00903: invalid table name Error.

I am able to connect to the database from my local machine with cx_Oracle or pyodbc. When I connect from local those queries that return the above error run without problems.

I've varied the queries that I run in either locally or remotely, but no matter what type of valid query I run

  ORACLE_JAR = "ojdbc7.jar"
  JAR_LOC = os.path.join(os.environ["JARS_DIR"], ORACLE_JAR)
  spark = SparkSession.builder \
      .appName("GetData") \
      .config("spark.jars", "local://" + JAR_LOC) \
      .getOrCreate()

  exadata_instances = ["xxx.yyy.zzz", "aaa.bbb.cc"]

  db_host = "xxx.yyy.zzz"
  user = 'username'
  password = 'passW0rd' 

  driver = "oracle.jdbc.OracleDriver"
  sid = "eee.fff.ggg"
  address_string = ""

  for exadata_instance in exadata_instances:
    address_string += f"(ADDRESS=(PROTOCOL=TCP)(HOST={exadata_instance})(PORT=1521))"

  tns = f"(DESCRIPTION= \
  (ADDRESS_LIST= \
  (LOAD_BALANCE=OFF) \
  (FAILOVER=ON) \
  {address_string}) \
  (CONNECT_DATA=(SERVICE_NAME={sid})(SERVER=DEDICATED)))"
  url = f"jdbc:oracle:thin:@{tns}"

The below are variations of some of the queries I've tried. Basically, I think I've exhausted the combinations of upper/lower case table and view names, terminated with or without a ;.

  dbtable = 'SELECT owner, table_name FROM all_tables'
  dbtable = 'SELECT owner, table_name FROM all_tables;'
  dbtable = 'SELECT owner, table_name FROM ALL_TABLES'
  dbtable = 'SELECT owner, table_name FROM ALL_TABLES;'
  dbtable = 'SELECT col1, col2 FROM V_MY_VIEW'
  dbtable = 'SELECT col1, col2 FROM V_MY_VIEW;'
  dbtable = 'SELECT COL1, COL2 FROM v_my_view'

Finally, with the above settings, I run the below pyspark command:

  jdbc_df = spark.read.format("jdbc").option("url", url) \
                                     .option("dbtable", dbtable) \
                                     .option("driver", driver) \
                                     .option("user", user) \
                                     .option("inferSchema", True) \
                                     .option("password", password).load()

Which gives rise to the error (in full):

Py4JJavaError                             Traceback (most recent call last)
in engine
----> 1 jdbc_df = spark.read.format("jdbc").option("url", url)                                      .option("dbtable", dbtable)                                      .option("driver", driver)                                      .option("user", user)                                      .option("inferSchema", True)                                      .option("password", password).load()

/opt/cloudera/parcels/SPARK2/lib/spark2/python/pyspark/sql/readwriter.py in load(self, path, format, schema, **options)
    163             return self._df(self._jreader.load(self._spark._sc._jvm.PythonUtils.toSeq(path)))
    164         else:
--> 165             return self._df(self._jreader.load())
    166 
    167     @since(1.4)

/conda/miniconda3/envs/python3.6.8/lib/python3.6/site-packages/py4j/java_gateway.py in __call__(self, *args)
   1255         answer = self.gateway_client.send_command(command)
   1256         return_value = get_return_value(
-> 1257             answer, self.gateway_client, self.target_id, self.name)
   1258 
   1259         for temp_arg in temp_args:

/opt/cloudera/parcels/SPARK2/lib/spark2/python/pyspark/sql/utils.py in deco(*a, **kw)
     61     def deco(*a, **kw):
     62         try:
---> 63             return f(*a, **kw)
     64         except py4j.protocol.Py4JJavaError as e:
     65             s = e.java_exception.toString()

/conda/miniconda3/envs/python3.6.8/lib/python3.6/site-packages/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
    326                 raise Py4JJavaError(
    327                     "An error occurred while calling {0}{1}{2}.
".
--> 328                     format(target_id, ".", name), value)
    329             else:
    330                 raise Py4JError(

Py4JJavaError: An error occurred while calling o1253.load.
: java.sql.SQLSyntaxErrorException: ORA-00903: invalid table name

    at oracle.jdbc.driver.T4CTTIoer.processError(T4CTTIoer.java:450)
    at oracle.jdbc.driver.T4CTTIoer.processError(T4CTTIoer.java:399)
    at oracle.jdbc.driver.T4C8Oall.processError(T4C8Oall.java:1059)
    at oracle.jdbc.driver.T4CTTIfun.receive(T4CTTIfun.java:522)
    at oracle.jdbc.driver.T4CTTIfun.doRPC(T4CTTIfun.java:257)
    at oracle.jdbc.driver.T4C8Oall.doOALL(T4C8Oall.java:587)
    at oracle.jdbc.driver.T4CPreparedStatement.doOall8(T4CPreparedStatement.java:225)
    at oracle.jdbc.driver.T4CPreparedStatement.doOall8(T4CPreparedStatement.java:53)
    at oracle.jdbc.driver.T4CPreparedStatement.executeForDescribe(T4CPreparedStatement.java:774)
    at oracle.jdbc.driver.OracleStatement.executeMaybeDescribe(OracleStatement.java:925)
    at oracle.jdbc.driver.OracleStatement.doExecuteWithTimeout(OracleStatement.java:1111)
    at oracle.jdbc.driver.OraclePreparedStatement.executeInternal(OraclePreparedStatement.java:4798)
    at oracle.jdbc.driver.OraclePreparedStatement.executeQuery(OraclePreparedStatement.java:4845)
    at oracle.jdbc.driver.OraclePreparedStatementWrapper.executeQuery(OraclePreparedStatementWrapper.java:1501)
    at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:62)
    at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation.(JDBCRelation.scala:113)
    at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:47)
    at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:306)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:146)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:280)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:214)
    at java.lang.Thread.run(Thread.java:748)

My gut feeling is that this isn't something in my code, but some setting either on the server or in the driver that I don't know how to access or control.

I'd be really grateful if someone could either tell me how I might go about debugging the issue or directly fix it. Thank you.

Alex Poole · Accepted Answer

From the documentation for dbtable:

The JDBC table that should be read from or written into. Note that when using it in the read path anything that is valid in a FROM clause of a SQL query can be used. For example, instead of a full table you could also use a subquery in parentheses.

So in your examples you could do:

dbtable = '(SELECT owner, table_name FROM ALL_TABLES)'

optionally with an alias:

dbtable = '(SELECT owner, table_name FROM ALL_TABLES) t'

As an alternative you could use query instead of (not as well as) dbtable:

A query that will be used to read data into Spark. The specified query will be parenthesized and used as a subquery in the FROM clause. Spark will also assign an alias to the subquery clause.

... so effectively the same thing, but might make your code more understandable (entirely subjective, of course), i.e. something like:

query = 'SELECT owner, table_name FROM ALL_TABLES'

and then:

  jdbc_df = spark.read.format("jdbc").option("url", url) \
                                     .option("query", query) \
                                     .option("driver", driver) \
                                     ...

pyspark read format jdbc generates ORA-00903: invalid table name Error

Answers (1)

Related Questions