Reputation: 3128
I have a spark job that moves data from Postgres to Redshift on regular basis. I' using jdbc.read
function with lowerBound
and upperBound
params:
df = spark.read.jdbc(url=jdbc_url, \
table='some_table',\
column='id',\
lowerBound=1,\
upperBound=20000000, \
numPartitions=50)
At the current moment upperBound
is hardcoded, but the size of table growing every day, so I need somehow update upperBound
value dynamically to reflect the size of the table at the start of next job run. How can I make upperBound
value equal to current size of the table?
Upvotes: 0
Views: 325
Reputation: 829
You can fetch upper bound value before you execute the main query and then use them
query = "(SELECT min({0}), max({0}) FROM {1}) AS temp".format(
partition_column, table
)
(lower_bound, upper_bound) = (spark.read
.jdbc(url=url, table=query. properties=properties)
.first())
df = spark.read.jdbc(url=jdbc_url, \
table='some_table',\
column='id',\
lowerBound=1,\
upperBound=upper_bound + 10, \
numPartitions=50)
Upvotes: 1