SreeVik
SreeVik

Reputation: 73

Query and partitionColumn - Spark Data read from Databricks JDBC end point

I am trying to read from a databricks JDBC end point using PySpark and write it to PostgreSQL. I am fetching 2 million records. To improve performance I have added a row_number column to my query and using it to partition the read. Below is my code and sql.

src_df= spark.read.format("jdbc"). \
        option("url", url). \
        option("dbtable", sqlquery). \
        option("fetchSize", 20000). \
        option("partitionColumn", "row_number"). \
        option("lowerBound", 1). \
        option("upperBound", 60000). \
        option("numPartitions", 8). \
        option("driver", driver). \
        option("user", user). \
        option("password", PAT_token).load()

I am defining the sql in a config file that I am retrieving in the code. sql entry from the .ini is as below

sqlquery= (select
        row_number() over(order by (select null)) as row_number,
        artifact1_name,
        artifact2_name) as q1

Since I do not want the row_number to be written into the PosgreSQL DB I am dropping it as below.

final_df=src_df.drop("row_number")

However the problem is that the src_df.count is coming as 0. I am not sure if this is the issue with databricks JDBC or the way I have defined the query in the .ini file. The code without any partitioning works fine against the same databricks end point. The below works fine without any issues.

src_df = spark.read.format("jdbc").\
         option("url", url).\
         option("query", sqlquery).\
         option("fetchSize", 30000).\
         option("driver", driver).\
         option("user", user).\
         option("password", pwd).load()

My understanding is that using partitionColumn, upperbound and lowerbound does not filter any data. Any help or pointers in this regard is appreciated. Thank you.

Upvotes: 2

Views: 1023

Answers (1)

matkurek
matkurek

Reputation: 781

The sqlquery you're using is generating the row_number column dynamically. For the parallelized reading to work properly partitionColumn needs to be a column on a physical table (spark generates a bunch of queries with where statements on that column using provided ranges of values).

If you add that column to the physical table, or use other existing numeric/date column, it should work.

Upvotes: 0

Related Questions