Reputation: 1142
I can read data from Oracle Database in Master node using this code:
val spark = SparkSession
.builder
.master("local[4]")
.config("spark.executor.memory", "8g")
.config("spark.executor.cores", 4)
.config("spark.task.cpus",1)
.appName("Spark SQL basic example")
.config("spark.some.config.option", "some-value")
.getOrCreate()
val jdbcDF = spark.read
.format("jdbc")
.option("url", "jdbc:oracle:thin:@x.x.x.x:1521:orcldb")
.option("dbtable", "table")
.option("user", "orcl")
.option("password", "********")
.load()
Then I can repartition the Dataframe among Workers:
val test = jdbcDF.repartition(8,col("ID_Col"))
test.explain
My issue is that my data is huge and they cannot fit on the Master RAM. As a result of that I want each node read its own data separately. I am wondering if there is any way to read data from database in every Worker and load them to Spark Dataframe. In fact, I want to load data to Spark Dataframe in each Worker Node separately using Scala or Python.
Would you please guide me how I can do that?
Any help is really appreciated.
Upvotes: 2
Views: 1291
Reputation: 18108
With local
you do not have a Resource Mgr
like YARN. You have no Workers
, but you can run stuff in parallel provided local[n]
set suitably on the same machine with N cores.
You will not be loading to the Master if you follow advice of Alex Ott and read up.
You can improve speed of loading by using parameters lowerBound, upperBound, numPartitions
when reading data with spark.read.jdbc
, using the Cores
instead of Executors
on Workers. That is what local means and how Spark works.
If you need to partition otherwise, you need to then do a subsequent re-partition.
If you have enough memory and disk, you will be slower but it will process.
Upvotes: 1