Randomize
Randomize

Reputation: 9103

Spark on Glue is not able to connect to AWS/ElasticSearch

I am running Spark inside Glue to write down to AWS/ElasticSearch with the following configuration for Spark:

  conf.set("es.nodes", s"$nodes/$indexName")
  conf.set("es.port", "443")
  conf.set("es.batch.write.retry.count", "200")
  conf.set("es.batch.size.bytes", "512kb")
  conf.set("es.batch.size.entries", "500")
  conf.set("es.index.auto.create", "false")
  conf.set("es.nodes.wan.only", "true")
  conf.set("es.net.ssl", "true")

however what I get is the following error:

diagnostics: User class threw exception: org.elasticsearch.hadoop.EsHadoopIllegalArgumentException: Cannot detect ES version - typically this happens if the network/Elasticsearch cluster is not accessible or when targeting a WAN/Cloud instance without the proper setting 'es.nodes.wan.only'
    at org.elasticsearch.hadoop.rest.InitializationUtils.discoverClusterInfo(InitializationUtils.java:340)
    at org.elasticsearch.spark.rdd.EsSpark$.doSaveToEs(EsSpark.scala:104)
    ....

I know in which "VPC" is running my ElasticSearch instance, but I am not sure how to set that for Glue/Spark or if it is a different problem. Any idea?

I have also tried to add a "glue jdbc" connection which should use the proper VPC connection but I am not sure how set it up properly:

  import scala.reflect.runtime.universe._
  def saveToEs[T <: Product : TypeTag](index: String, data: RDD[T]) =
    SparkProvider.glueContext.getJDBCSink(
      catalogConnection = "my-elasticsearch-connection",
      options = JsonOptions(
        "WHAT HERE?"
      ),
      transformationContext = "SinkToElasticSearch"
    ).writeDynamicFrame(DynamicFrame(
      SparkProvider.sqlContext.createDataFrame[T](data),
      SparkProvider.glueContext))

Upvotes: 0

Views: 1477

Answers (1)

Eman
Eman

Reputation: 851

Try to create to create a dummy JDBC connection. The dummy connection will tell Glue the ES - VPC, subnet and security group. A test connection might not work but when you run your job with the connection, it will use the connection metadata to launch elastic network interface in your VPC to facilitate this communication. More on connections can be found here:

[1] https://docs.aws.amazon.com/glue/latest/dg/start-connecting.html

Upvotes: 1

Related Questions