Aru
Aru

Reputation: 184

Dataframe will directly connect to RDBMS from Executor or it will go through Driver?

In Spark Dataframe im looking for following under the hood explanation for optimization.

  1. Data Frames are special type of RDD, it internally contains Row RDDs. These RowRDDs are spread across the executors.
  2. When we write these RowRDDs from executors ( especially while running in YARN-CLIENT mode ) the Row RDDs will be transferred from EXECUTOR to DRIVER and DRIVER writes into Oracle using the JDBC connection.(Is this True?)
  3. When run the same code in YARN-CLUSTER mode, the Row RDDs are written to Oracle directly from Executor. This could be a faster approach, but available JDBC connection could limit /slow down the process.

I'm not sure this is what happens under the hood kindly validate this and correct me if im wrong. This will impact a big performance factor.

Thanks in advance.

Upvotes: 1

Views: 279

Answers (1)

Tagar
Tagar

Reputation: 14939

Each executor makes its own connection.

val df = (spark.read.jdbc(url=jdbcUrl,
    dbtable="employees",
    columnName="emp_no",
    lowerBound=1L,
    upperBound=100000L,
    numPartitions=100,
    connectionProperties=connectionProperties))
display(df)

In the Spark UI, you will see that the numPartitions dictate the number of tasks that are launched. Each task is spread across the executors and this can increase the parallelism of the reads and writes through the JDBC interface. Look at the upstream guide to look into other parameters that can help with performance such as the fetchsize option.

Upvotes: 0

Related Questions