Reputation: 2083
I am trying to read a table on postgres db and load the data into a Hive table on HDFS as below:
val yearDF = spark.read.format("jdbc").option("url", connectionUrl).option("dbtable", s"(${execQuery}) as year2017")
.option("user", devUserName).option("password", devPassword)
.option("partitionColumn","source_system_name")
.option("lowerBound", 10000).option("upperBound", 50000)
.option("numPartitions",12).load()
My job is failing with the message: Container exited with a non-zero exit code 143 & GC Overhead issue.
The data is huge and causing skew in partitions.
After observing the data in the table, I understood that 'source_system_name' is not the right column to partition the data on. But I have found that a combination of three other columns: source_system_name, org_code, period
which are best suited to split the data while reading and processing. (I did a 'group by' to find out count of rows in the table and the data is a bit evenly spread across each group)
But I don't know how can I use all three columns in the spark-jdbc's partitionColumn
option. Could anyone let me know if it is possible to use multiple columns in the option: partitionColumn while reading data from an RDBMS table.
Upvotes: 0
Views: 1475
Reputation: 18098
No, the following applies:
For partitionColumn, lowerBound, upperBound - these options must all be specified if any of them is specified. In addition, numPartitions must be specified.
They describe how to partition the table when reading in parallel from multiple workers. partitionColumn must be a numeric column from the table in question. This latter implies it can only be 1 column.
lowerBound and upperBound are just used to decide the partition stride, not for filtering the rows in table. So all rows in the table will be partitioned and returned.
Upvotes: 3