Reputation: 184
In Spark Dataframe im looking for following under the hood explanation for optimization.
I'm not sure this is what happens under the hood kindly validate this and correct me if im wrong. This will impact a big performance factor.
Thanks in advance.
Upvotes: 1
Views: 279
Reputation: 14939
Each executor makes its own connection.
val df = (spark.read.jdbc(url=jdbcUrl,
dbtable="employees",
columnName="emp_no",
lowerBound=1L,
upperBound=100000L,
numPartitions=100,
connectionProperties=connectionProperties))
display(df)
In the Spark UI, you will see that the numPartitions dictate the number of tasks that are launched. Each task is spread across the executors and this can increase the parallelism of the reads and writes through the JDBC interface. Look at the upstream guide to look into other parameters that can help with performance such as the fetchsize option.
Upvotes: 0