How to load data at Spark Dataframe in each Worker to prevent loading huge data to Master node

Question

I can read data from Oracle Database in Master node using this code:

 val spark = SparkSession
            .builder
            .master("local[4]")
            .config("spark.executor.memory", "8g")
            .config("spark.executor.cores", 4)
            .config("spark.task.cpus",1)
            .appName("Spark SQL basic example")
            .config("spark.some.config.option", "some-value")
            .getOrCreate()

 val jdbcDF = spark.read
              .format("jdbc")
              .option("url", "jdbc:oracle:thin:@x.x.x.x:1521:orcldb")
              .option("dbtable", "table")
              .option("user", "orcl")
              .option("password", "********")
              .load()

Then I can repartition the Dataframe among Workers:

  val test = jdbcDF.repartition(8,col("ID_Col"))
  test.explain

My issue is that my data is huge and they cannot fit on the Master RAM. As a result of that I want each node read its own data separately. I am wondering if there is any way to read data from database in every Worker and load them to Spark Dataframe. In fact, I want to load data to Spark Dataframe in each Worker Node separately using Scala or Python.

Would you please guide me how I can do that?

Any help is really appreciated.

How to load data at Spark Dataframe in each Worker to prevent loading huge data to Master node

Answers (1)

Related Questions