M_Gh
M_Gh

Reputation: 1142

How to load data at Spark Dataframe in each Worker to prevent loading huge data to Master node

I can read data from Oracle Database in Master node using this code:

 val spark = SparkSession
            .builder
            .master("local[4]")
            .config("spark.executor.memory", "8g")
            .config("spark.executor.cores", 4)
            .config("spark.task.cpus",1)
            .appName("Spark SQL basic example")
            .config("spark.some.config.option", "some-value")
            .getOrCreate()

 val jdbcDF = spark.read
              .format("jdbc")
              .option("url", "jdbc:oracle:thin:@x.x.x.x:1521:orcldb")
              .option("dbtable", "table")
              .option("user", "orcl")
              .option("password", "********")
              .load()

Then I can repartition the Dataframe among Workers:

  val test = jdbcDF.repartition(8,col("ID_Col"))
  test.explain

My issue is that my data is huge and they cannot fit on the Master RAM. As a result of that I want each node read its own data separately. I am wondering if there is any way to read data from database in every Worker and load them to Spark Dataframe. In fact, I want to load data to Spark Dataframe in each Worker Node separately using Scala or Python.

Would you please guide me how I can do that?

Any help is really appreciated.

Upvotes: 2

Views: 1291

Answers (1)

Ged
Ged

Reputation: 18108

With local you do not have a Resource Mgr like YARN. You have no Workers, but you can run stuff in parallel provided local[n] set suitably on the same machine with N cores.

You will not be loading to the Master if you follow advice of Alex Ott and read up.

You can improve speed of loading by using parameters lowerBound, upperBound, numPartitions when reading data with spark.read.jdbc, using the Cores instead of Executors on Workers. That is what local means and how Spark works.

If you need to partition otherwise, you need to then do a subsequent re-partition.

If you have enough memory and disk, you will be slower but it will process.

Upvotes: 1

Related Questions