How to amend properties in Spark read jdbc according to growing size of table?

Question

I have a spark job that moves data from Postgres to Redshift on regular basis. I' using jdbc.read function with lowerBound and upperBound params:

df = spark.read.jdbc(url=jdbc_url, \
          table='some_table',\
          column='id',\
          lowerBound=1,\
          upperBound=20000000, \
          numPartitions=50)

At the current moment upperBound is hardcoded, but the size of table growing every day, so I need somehow update upperBound value dynamically to reflect the size of the table at the start of next job run. How can I make upperBound value equal to current size of the table?

arjunsv3691 · Accepted Answer

You can fetch upper bound value before you execute the main query and then use them

query = "(SELECT min({0}), max({0}) FROM {1}) AS temp".format(
    partition_column, table
)

(lower_bound, upper_bound) = (spark.read
    .jdbc(url=url, table=query. properties=properties)
    .first())

df = spark.read.jdbc(url=jdbc_url, \
          table='some_table',\
          column='id',\
          lowerBound=1,\
          upperBound=upper_bound + 10, \
          numPartitions=50)

How to amend properties in Spark read jdbc according to growing size of table?

Answers (1)

Related Questions