c2n
c2n

Reputation: 83

Read a SQL file in Python to run against a JDBC via databricks

I have a SQL statement that I want to run against an oracle database using a JDBC driver in databricks. I can get this to successfully run if the SQL statement is quite short, for example if it's just selecting all of the data from a table with no filters etc.. (e.g. select * from tbl)

However, I have an extremely long SQL code that I need to execute so I am creating the string to pass through to the JDBC driver by loading it from a .sql file saved on the databricks file storage.

When running this I was presented with an error and on investigation / printing the results of the text file I find it drops some of the SQL statement and provides a message before resuming the sql statement:

*** WARNING: skipped 62431 bytes of output ***

Effectively it looks like this in the printed string:

    sum (
        case
            when dpr.pricing_Type in ('P', 'C') then
                nvl (
                    decode (dpr.price / 100, null, 0,
                        decode (apr.price, 'N', 

*** WARNING: skipped 62431 bytes of output ***

        then
            dpr.percentage_applied
        else 
            0
        end
    ) as price_percent,

Note that the code prior to the warning is for a completely different field to that of the code after the warning message.

Are there any suggestions on the cause of this and how to resolve it?

The full script I am running is below for reference, also note that the .sql file I am using is only 113kb and I am using python 3.7.5 via runtime 7.4 of databricks:

%python

# setup jdbc credentials (from key vault) and url
jdbcUsername = dbutils.secrets.get(scope="USER", key="ID")
jdbcPassword = dbutils.secrets.get(scope="PWD", key="PWD")
jdbcUrl = "jdbc:oracle:thin:@<REDACTED>"
jdbcDrv = "oracle.jdbc.driver.OracleDriver"

# Table Name
OutputTbl = "db.tblCore"

# Drop table. 
spark.sql("DROP TABLE IF EXISTS " + OutputTbl  )

# parallelism
lbound = 20160101
ubound = 20210115
fsize = "1000"
colname = "date_value_yyyymmdd"
numParts = "10"

# Get sql stetment from file.
with open('/dbfs/FileStore/shared_uploads/<REDACTED>/SQL', 'r') as f:
    sql = file.read()

# Create DF and write output to a table. 
spdf = (spark.read.format("jdbc")
  .option("driver", jdbcDrv)
  .option("url", jdbcUrl)
  .option("user", jdbcUsername)
  .option("password", jdbcPassword)
  .option("dbtable", sql)
  .option("numPartitions", numParts)
  .option("fetchsize", fsize)
  .option("partitionColumn", colname) 
  .option("lowerBound", lbound)
  .option("upperBound", ubound)
  .load())

spdf.write.mode("overwrite").saveAsTable(OutputTbl)

Upvotes: 1

Views: 1821

Answers (1)

Alex Ott
Alex Ott

Reputation: 87184

This is not an error, it's just a warning that says that the output was truncated to prevent overloading of the browser, etc. You may look into driver's & executors log via Spark UI of your cluster - there should be more information...

I would also suggest to first try to execute that statement directly against Oracle, just to check if it works at all

Upvotes: 1

Related Questions