Reputation: 442
I'm having issues connecting from an AWS EMR cluster with spark running to another AWS EMR cluster running presto.
The code - written in python - is:
jdbcDF = spark.read \
.format("jdbc") \
.option("driver", "com.facebook.presto.jdbc.PrestoDriver")\
.option("url", "jdbc:presto://ec2-xxxxxxxxxxxx.ap-southeast-2.compute.amazonaws.com:8889/hive/data-lake") \
.option("user", "hadoop") \
.option("dbtable", "customer") \
.load()\
deployed via aws emr add-steps
with the option --packages,\'org.apache.spark:spark-streaming-kinesis-asl_2.11:2.4.0,org.postgresql:postgresql:42.2.9,com.facebook.presto:presto-jdbc:0.60\',\
Which when deployed throws the following error
Exception in thread "main" java.lang.reflect.UndeclaredThrowableException at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1862) at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:64) at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:237) at org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:330) at org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala) Caused by: org.apache.spark.SparkException: Exception thrown in awaitResult: at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:226) at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75) at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:101) at org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:250) at org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:65) at org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:64) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844) ... 4 more Caused by: java.io.IOException: Failed to connect to ip-xxxx-xxx.ap-southeast-2.compute.internal/xxx-xxxx:41885 at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:245) at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:187) at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:198) at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:194) at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:190) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: ip-xxxxxxxxx.ap-southeast-2.compute.internal/xxxxxx:41885 at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:323) at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:633) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459) at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858) at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138) ... 1 more Caused by: java.net.ConnectException: Connection refused ... 11 more End of LogType:stderr
Whilst I've redacted the IP address above (safety first), it is the same internal IP address on the spark slave instance. It appears to be connecting to itself and having a connection issue.
I've opened up the ports in AWS EC2 security groups, allowing access from both spark master/slave to the presto instance.
If it helps, a quick node script written to test connectivity works
var client = new presto.Client({
host: prestoEndpoint,
user: 'hadoop',
port: 8889,
});
client.execute({
query: 'select * from customer',
catalog: 'hive',
schema: 'data-lake',
source: 'nodejs-client',
state: function(error, query_id, stats) {
console.log({ message: 'status changed', id: query_id, stats: stats });
},
columns: function(error, data) {
console.log({ resultColumns: data });
},
data: function(error, data, columns, stats) {
console.log({data, columns});
},
success: function(error, stats) {
console.log(error);
console.log(JSON.stringify(stats, null,2));
},
error: function(error) {
console.log(error);
},
});
the key part of the error message seems to be
Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: ip-xxxxxxxxx.ap-southeast-2.compute.internal/xxxxxx:41885
Upvotes: 1
Views: 1167
Reputation: 442
Problem was the version number of the prest-jdbc driver
I updated it from com.facebook.presto:presto-jdbc:0.60
to
com.facebook.presto:presto-jdbc:0.225
so the full packages paramater is
--packages,\'org.apache.spark:spark-streaming-kinesis-asl_2.11:2.4.0,org.postgresql:postgresql:42.2.9,com.facebook.presto:presto-jdbc:0.255\',\
thanks to @Lamanus for spotting that one
Upvotes: 0