Reputation: 257
I am trying to connect to Presto DB through my Spark Scala code and running it on an EMR Cluster. I am able to create the RDD but when the worker nodes are trying to fetch the data the code fails saying the file not found (keystore not exists) though it is present in the master node. Is there a way I can copy the Keystore file to child nodes? Below is my code and steps I am following
First step I copy the certificate to tmp folder using below command
s3-dist-cp --src s3://test/rootca_ca.jks --dest /tmp/
Then I run the below code with the following command
spark-submit --executor-memory=10G --driver-memory=10G --executor-cores=2 --jars s3://test1/jars/presto-jdbc-338-e.0.jar --class com.asurion.prestotest --master yarn s3://test1/script/prestotest.jar
package com.asurion
import org.apache.spark._
import org.apache.spark.SparkContext._
import org.apache.log4j._
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import org.apache.spark.sql.SparkSession
import org.apache.spark.storage.StorageLevel
import java.time.LocalDateTime
import java.util.concurrent._
object prestotest {
def main(args: Array[String]) {
Logger.getLogger("org").setLevel(Level.OFF)
Logger.getLogger("akka").setLevel(Level.OFF)
val conf = new SparkConf().setAppName("testapp")
val sc = new SparkContext(conf);
val sqlcontext = SparkSession.builder().getOrCreate()
val carrier_info =" select * from test_tbl "
val enrdata = sqlcontext.read.format("jdbc").option("url", "jdbc:presto://test.atlas.prd.aws.test.net:18443/hive").option("SSL","true").option("SSLTrustStorePath","/tmp/rootca_ca.jks").option("SSLTrustStorePassword","pass1").option("query", carrier_info).option("user", "user1").option("password", "pass2").option("driver", "io.prestosql.jdbc.PrestoDriver").load()
println("Writing Statistics" )
enrdata.show(5)
println("Writing done" )
}
}
Error:
scheduler.TaskSetManager (Logging.scala:logWarning(66)): Lost task 0.0 in stage 0.0 (TID 0, 100.64.187.253, executor 1): java.sql.SQLException: Error setting up SSL: /tmp/rootca_ca.jks (No such file or directory)
at io.prestosql.jdbc.PrestoDriverUri.setupClient(PrestoDriverUri.java:235)
at io.prestosql.jdbc.PrestoDriver.connect(PrestoDriver.java:88)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createConnectionFactory$1.apply(JdbcUtils.scala:63)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createConnectionFactory$1.apply(JdbcUtils.scala:54)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD.compute(JDBCRDD.scala:272)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: io.prestosql.jdbc.$internal.client.ClientException: Error setting up SSL: /tmp/rootca_ca.jks (No such file or directory)
at io.prestosql.jdbc.$internal.client.OkHttpUtil.setupSsl(OkHttpUtil.java:241)
at io.prestosql.jdbc.PrestoDriverUri.setupClient(PrestoDriverUri.java:203)
... 23 more
Caused by: java.io.FileNotFoundException: /tmp/rootca_ca.jks (No such file or directory)
at java.io.FileInputStream.open0(Native Method)
at java.io.FileInputStream.open(FileInputStream.java:195)
at java.io.FileInputStream.<init>(FileInputStream.java:138)
at io.prestosql.jdbc.$internal.client.OkHttpUtil.loadTrustStore(OkHttpUtil.java:308)
at io.prestosql.jdbc.$internal.client.OkHttpUtil.setupSsl(OkHttpUtil.java:220)
Upvotes: 1
Views: 1469
Reputation: 1410
Spark on EMR create the driver in one of of the CORE node (by default) not on master code.
A driver (On CORE node) can't access files in master node.
So what option do you have -
rootca_ca.jks
file (s3 cp
) to every worker (CORE & TASK) node and you don't to change anything in the programs3-dist-cp
to copy the file, it puts your file in HDFS not in linux file system.hdfs:///tmp/rootca_ca.jks
.s3://test/rootca_ca.jks
.Option 1: is very hard to maintain over time, if your file needs to be changed on running cluster you will have to do it manually on every worker node.
Option 2: as HDFS is shared file system, you can need to maintain the file in one place.
Option 3: there is not much difference as option 2, just that your file stays and read from S3. You don't have to copy the file in HDFS.
Upvotes: 1