HDP: Make Spark RDD Tables accessible via JDBC using Hive Thrift

Question

I'm using Spark Streaming to analyze Tweets in a sliding window. As don't want to save all data but just the current data of the window, I want query the data directly from memory.

My problem is pretty much identical to this one:

How to Access RDD Tables via Spark SQL as a JDBC Distributed Query Engine?

This is the important part of my code:

sentimentedWords.foreachRDD { rdd =>
  val hiveContext = new HiveContext(SparkContext.getOrCreate())
  import hiveContext.implicits._
  val dataFrame = rdd.toDF("sentiment", "tweet")

  dataFrame.registerTempTable("tweets")
  HiveThriftServer2.startWithContext(hiveContext)
}

As I found out the HiveThriftServer2.startWithContext(hiveContext)line starts up a new ThriftServer that should provide access to the tempTable via JDBC. However, I get the following Exception in my console:

org.apache.thrift.transport.TTransportException: Could not create ServerSocket on address 0.0.0.0/0.0.0.0:10000.
    at org.apache.thrift.transport.TServerSocket.(TServerSocket.java:93)
    at org.apache.thrift.transport.TServerSocket.(TServerSocket.java:79)
    at org.apache.hive.service.auth.HiveAuthFactory.getServerSocket(HiveAuthFactory.java:236)
    at org.apache.hive.service.cli.thrift.ThriftBinaryCLIService.run(ThriftBinaryCLIService.java:69)
    at java.lang.Thread.run(Thread.java:745)

As I'm using Hortonworks Data Platform (HDP) the port 10000 is already in use by the default Hive Thrift Server! I logged into Ambari and changed the ports as follows:


  hive.server2.thrift.http.port
  12345



  hive.server2.thrift.port
  12345

But this made it worse. Now Ambari shows that it can't start the service due to some ConnectionRefused error. Other ports like 10001 don't work either. And the port 10000 is still in use after restarting Hive.

I assume that if I can use the port 10000 for my Spark application/ThriftServer and move the default Hive ThriftServer to some other port then everything should be fine. Alternatively I could also tell my application to start the ThriftServer on a different port, but I don't know if that's possible.

Any ideas?

Additional comment: Killing the service listening on port 10000 has no effect.

Tim · Accepted Answer

I finally fixed the problem as follows:

As I'm using Spark Streaming my job is running in an infinite loop. In the loop I had the line that starts the Thrift Server:

HiveThriftServer2.startWithContext(hiveContext)

This resulted in my console being spammed with the "Could not create ServerSocket" messages. I overlooked that my code is working fine and that I just accidentially tried to start multiple servers... awkward.

What's also important to mention: If you are using Hortonworks HDP: Do not use the beeline command to start beeline. Start the "correct" beeline that can be found in your $SPARK_HOME/bin/beeline. This took me hours to find out! I don't know what's wrong with the regular beeline and at this point I don't care anymore to be honest...

Besides that: After a restart of my HDP Sandbox the ConnectionRefused issue with Ambari also was gone.

HDP: Make Spark RDD Tables accessible via JDBC using Hive Thrift

Answers (1)

Related Questions