Reputation: 6383
I have a simple Java application that can connect and query my cluster using Hive or Impala using code like
import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.ResultSet;
import java.sql.SQLException;
import java.sql.Statement;
...
Class.forName("com.cloudera.hive.jdbc41.HS2Driver");
Connection con = DriverManager.getConnection("jdbc:hive2://myHostIP:10000/mySchemaName;hive.execution.engine=spark;AuthMech=1;KrbRealm=myHostIP;KrbHostFQDN=myHostIP;KrbServiceName=hive");
Statement stmt = con.createStatement();
ResultSet rs = stmt.executeQuery("select * from foobar");
But now I want to try doing the same query but with Spark SQL. I'm having a hard time figuring out how to use the Spark SQL API though. Specifically how to setup the connection. I see examples of how to setup the Spark Session but it's unclear what values I need to provide for example
SparkSession spark = SparkSession
.builder()
.appName("Java Spark SQL basic example")
.config("spark.some.config.option", "some-value")
.getOrCreate();
How do I tell Spark SQL what Host and Port to use, what Schema to use, and how do I tell Spark SQL which authentication technique I'm using? For example I'm using Kerberos to authenticate.
The above Spark SQL code is from https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/sql/JavaSparkSQLExample.java
UPDATE:
I was able to make a little progress and I think I figured out how to tell the Spark SQL connection what Host and Port to use.
...
SparkSession spark = SparkSession
.builder()
.master("spark://myHostIP:10000")
.appName("Java Spark Hive Example")
.enableHiveSupport()
.getOrCreate();
And I added the following dependency in my pom.xml file
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_2.11</artifactId>
<version>2.0.0</version>
</dependency>
With this update I can see that the connection is getting further but it appears it's now failing because I'm not authenticated. I need to figure out how to authenticate using Kerberos. Here's the relevant log data
2017-12-19 11:17:55.717 INFO 11912 --- [o-auto-1-exec-1] org.apache.spark.util.Utils : Successfully started service 'SparkUI' on port 4040.
2017-12-19 11:17:55.717 INFO 11912 --- [o-auto-1-exec-1] org.apache.spark.ui.SparkUI : Bound SparkUI to 0.0.0.0, and started at http://myHostIP:4040
2017-12-19 11:17:56.065 INFO 11912 --- [er-threadpool-0] s.d.c.StandaloneAppClient$ClientEndpoint : Connecting to master spark://myHostIP:10000...
2017-12-19 11:17:56.260 INFO 11912 --- [pc-connection-0] o.a.s.n.client.TransportClientFactory : Successfully created connection to myHostIP:10000 after 113 ms (0 ms spent in bootstraps)
2017-12-19 11:17:56.354 WARN 11912 --- [huffle-client-0] o.a.s.n.server.TransportChannelHandler : Exception in connection from myHostIP:10000
java.io.IOException: An existing connection was forcibly closed by the remote host
Upvotes: 10
Views: 8094
Reputation: 11
You can try to do Kerberos login before running connection:
Configuration conf = new Configuration();
conf.set("fs.hdfs.impl", DistributedFileSystem.class.getName());
conf.addResource(pathToHdfsSite);
conf.addResource(pathToCoreSite);
conf.set("hadoop.security.authentication", "kerberos");
conf.set("hadoop.rpc.protection", "privacy");
UserGroupInformation.setConfiguration(conf);
UserGroupInformation.loginUserFromKeytab(ktUserName, ktPath);
//your code here
ktUserName is principal here, like - [email protected] And you need to have core-site.xml, hdfs-site.xml and keytab at your machine to run this.
Upvotes: 1
Reputation: 918
Dataframe creation using Impala with Kerberos authentication
I am able to do Impala connection with kerberos authentication. Checkout my git repo here. Maybe this will be of some help.
Upvotes: 0