Metadata
Metadata

Reputation: 2083

Is it possible to give multiple columns in partitionColumns while working with Spark-JDBC?

I am trying to read a table on postgres db and load the data into a Hive table on HDFS as below:

val yearDF = spark.read.format("jdbc").option("url", connectionUrl).option("dbtable", s"(${execQuery}) as year2017")
.option("user", devUserName).option("password", devPassword)
.option("partitionColumn","source_system_name")
.option("lowerBound", 10000).option("upperBound", 50000)
.option("numPartitions",12).load()

My job is failing with the message: Container exited with a non-zero exit code 143 & GC Overhead issue.

The data is huge and causing skew in partitions. After observing the data in the table, I understood that 'source_system_name' is not the right column to partition the data on. But I have found that a combination of three other columns: source_system_name, org_code, period which are best suited to split the data while reading and processing. (I did a 'group by' to find out count of rows in the table and the data is a bit evenly spread across each group) But I don't know how can I use all three columns in the spark-jdbc's partitionColumn option. Could anyone let me know if it is possible to use multiple columns in the option: partitionColumn while reading data from an RDBMS table.

Upvotes: 0

Views: 1475

Answers (1)

Ged
Ged

Reputation: 18098

No, the following applies:

  • For partitionColumn, lowerBound, upperBound - these options must all be specified if any of them is specified. In addition, numPartitions must be specified.

  • They describe how to partition the table when reading in parallel from multiple workers. partitionColumn must be a numeric column from the table in question. This latter implies it can only be 1 column.

  • lowerBound and upperBound are just used to decide the partition stride, not for filtering the rows in table. So all rows in the table will be partitioned and returned.

Upvotes: 3

Related Questions