Reputation: 662
With a pysqpark running on a remote server, I am able to connect to an Oracle database on another server with jdbc, but any valid query I run returns a ORA-00903: invalid table name Error
.
I am able to connect to the database from my local machine with cx_Oracle
or pyodbc
. When I connect from local those queries that return the above error run without problems.
I've varied the queries that I run in either locally or remotely, but no matter what type of valid query I run
ORACLE_JAR = "ojdbc7.jar"
JAR_LOC = os.path.join(os.environ["JARS_DIR"], ORACLE_JAR)
spark = SparkSession.builder \
.appName("GetData") \
.config("spark.jars", "local://" + JAR_LOC) \
.getOrCreate()
exadata_instances = ["xxx.yyy.zzz", "aaa.bbb.cc"]
db_host = "xxx.yyy.zzz"
user = 'username'
password = 'passW0rd'
driver = "oracle.jdbc.OracleDriver"
sid = "eee.fff.ggg"
address_string = ""
for exadata_instance in exadata_instances:
address_string += f"(ADDRESS=(PROTOCOL=TCP)(HOST={exadata_instance})(PORT=1521))"
tns = f"(DESCRIPTION= \
(ADDRESS_LIST= \
(LOAD_BALANCE=OFF) \
(FAILOVER=ON) \
{address_string}) \
(CONNECT_DATA=(SERVICE_NAME={sid})(SERVER=DEDICATED)))"
url = f"jdbc:oracle:thin:@{tns}"
The below are variations of some of the queries I've tried. Basically, I think I've exhausted the combinations of upper/lower case table and view names, terminated with or without a ;
.
dbtable = 'SELECT owner, table_name FROM all_tables'
dbtable = 'SELECT owner, table_name FROM all_tables;'
dbtable = 'SELECT owner, table_name FROM ALL_TABLES'
dbtable = 'SELECT owner, table_name FROM ALL_TABLES;'
dbtable = 'SELECT col1, col2 FROM V_MY_VIEW'
dbtable = 'SELECT col1, col2 FROM V_MY_VIEW;'
dbtable = 'SELECT COL1, COL2 FROM v_my_view'
Finally, with the above settings, I run the below pyspark command:
jdbc_df = spark.read.format("jdbc").option("url", url) \
.option("dbtable", dbtable) \
.option("driver", driver) \
.option("user", user) \
.option("inferSchema", True) \
.option("password", password).load()
Which gives rise to the error (in full):
Py4JJavaError Traceback (most recent call last)
in engine
----> 1 jdbc_df = spark.read.format("jdbc").option("url", url) .option("dbtable", dbtable) .option("driver", driver) .option("user", user) .option("inferSchema", True) .option("password", password).load()
/opt/cloudera/parcels/SPARK2/lib/spark2/python/pyspark/sql/readwriter.py in load(self, path, format, schema, **options)
163 return self._df(self._jreader.load(self._spark._sc._jvm.PythonUtils.toSeq(path)))
164 else:
--> 165 return self._df(self._jreader.load())
166
167 @since(1.4)
/conda/miniconda3/envs/python3.6.8/lib/python3.6/site-packages/py4j/java_gateway.py in __call__(self, *args)
1255 answer = self.gateway_client.send_command(command)
1256 return_value = get_return_value(
-> 1257 answer, self.gateway_client, self.target_id, self.name)
1258
1259 for temp_arg in temp_args:
/opt/cloudera/parcels/SPARK2/lib/spark2/python/pyspark/sql/utils.py in deco(*a, **kw)
61 def deco(*a, **kw):
62 try:
---> 63 return f(*a, **kw)
64 except py4j.protocol.Py4JJavaError as e:
65 s = e.java_exception.toString()
/conda/miniconda3/envs/python3.6.8/lib/python3.6/site-packages/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
326 raise Py4JJavaError(
327 "An error occurred while calling {0}{1}{2}.\n".
--> 328 format(target_id, ".", name), value)
329 else:
330 raise Py4JError(
Py4JJavaError: An error occurred while calling o1253.load.
: java.sql.SQLSyntaxErrorException: ORA-00903: invalid table name
at oracle.jdbc.driver.T4CTTIoer.processError(T4CTTIoer.java:450)
at oracle.jdbc.driver.T4CTTIoer.processError(T4CTTIoer.java:399)
at oracle.jdbc.driver.T4C8Oall.processError(T4C8Oall.java:1059)
at oracle.jdbc.driver.T4CTTIfun.receive(T4CTTIfun.java:522)
at oracle.jdbc.driver.T4CTTIfun.doRPC(T4CTTIfun.java:257)
at oracle.jdbc.driver.T4C8Oall.doOALL(T4C8Oall.java:587)
at oracle.jdbc.driver.T4CPreparedStatement.doOall8(T4CPreparedStatement.java:225)
at oracle.jdbc.driver.T4CPreparedStatement.doOall8(T4CPreparedStatement.java:53)
at oracle.jdbc.driver.T4CPreparedStatement.executeForDescribe(T4CPreparedStatement.java:774)
at oracle.jdbc.driver.OracleStatement.executeMaybeDescribe(OracleStatement.java:925)
at oracle.jdbc.driver.OracleStatement.doExecuteWithTimeout(OracleStatement.java:1111)
at oracle.jdbc.driver.OraclePreparedStatement.executeInternal(OraclePreparedStatement.java:4798)
at oracle.jdbc.driver.OraclePreparedStatement.executeQuery(OraclePreparedStatement.java:4845)
at oracle.jdbc.driver.OraclePreparedStatementWrapper.executeQuery(OraclePreparedStatementWrapper.java:1501)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:62)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation.<init>(JDBCRelation.scala:113)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:47)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:306)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:146)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)
My gut feeling is that this isn't something in my code, but some setting either on the server or in the driver that I don't know how to access or control.
I'd be really grateful if someone could either tell me how I might go about debugging the issue or directly fix it. Thank you.
Upvotes: 2
Views: 2508
Reputation: 191275
From the documentation for dbtable
:
The JDBC table that should be read from or written into. Note that when using it in the read path anything that is valid in a
FROM
clause of a SQL query can be used. For example, instead of a full table you could also use a subquery in parentheses.
So in your examples you could do:
dbtable = '(SELECT owner, table_name FROM ALL_TABLES)'
optionally with an alias:
dbtable = '(SELECT owner, table_name FROM ALL_TABLES) t'
As an alternative you could use query
instead of (not as well as) dbtable
:
A query that will be used to read data into Spark. The specified query will be parenthesized and used as a subquery in the
FROM
clause. Spark will also assign an alias to the subquery clause.
... so effectively the same thing, but might make your code more understandable (entirely subjective, of course), i.e. something like:
query = 'SELECT owner, table_name FROM ALL_TABLES'
and then:
jdbc_df = spark.read.format("jdbc").option("url", url) \
.option("query", query) \
.option("driver", driver) \
...
Upvotes: 5