Reputation: 73
I am trying to read from a databricks JDBC end point using PySpark and write it to PostgreSQL. I am fetching 2 million records. To improve performance I have added a row_number column to my query and using it to partition the read. Below is my code and sql.
src_df= spark.read.format("jdbc"). \
option("url", url). \
option("dbtable", sqlquery). \
option("fetchSize", 20000). \
option("partitionColumn", "row_number"). \
option("lowerBound", 1). \
option("upperBound", 60000). \
option("numPartitions", 8). \
option("driver", driver). \
option("user", user). \
option("password", PAT_token).load()
I am defining the sql in a config file that I am retrieving in the code. sql entry from the .ini is as below
sqlquery= (select
row_number() over(order by (select null)) as row_number,
artifact1_name,
artifact2_name) as q1
Since I do not want the row_number to be written into the PosgreSQL DB I am dropping it as below.
final_df=src_df.drop("row_number")
However the problem is that the src_df.count is coming as 0. I am not sure if this is the issue with databricks JDBC or the way I have defined the query in the .ini file. The code without any partitioning works fine against the same databricks end point. The below works fine without any issues.
src_df = spark.read.format("jdbc").\
option("url", url).\
option("query", sqlquery).\
option("fetchSize", 30000).\
option("driver", driver).\
option("user", user).\
option("password", pwd).load()
My understanding is that using partitionColumn, upperbound and lowerbound does not filter any data. Any help or pointers in this regard is appreciated. Thank you.
Upvotes: 2
Views: 1023
Reputation: 781
The sqlquery
you're using is generating the row_number
column dynamically. For the parallelized reading to work properly partitionColumn
needs to be a column on a physical table (spark generates a bunch of queries with where statements on that column using provided ranges of values).
If you add that column to the physical table, or use other existing numeric/date column, it should work.
Upvotes: 0