PySpark + ElasticSearch: Reading multiple index/type

Question

I'm running PySpark with Elasticsearch back using the Elasticsearch-hadoop connector. I can read from a desired index using:

    es_read_conf = {
        "es.nodes": "127.0.0.1",
        "es.port": "9200",
        "es.resource": "myIndex_*/myType"
    }
    conf = SparkConf().setAppName("devproj")
    sc = SparkContext(conf=conf)

    es_rdd = sc.newAPIHadoopRDD(
        inputFormatClass="org.elasticsearch.hadoop.mr.EsInputFormat",
        keyClass="org.apache.hadoop.io.NullWritable",
        valueClass="org.elasticsearch.hadoop.mr.LinkedMapWritable",
        conf=es_read_conf
    )

Works fine. I can wildcard the index.

How do I wildcard the document "type"? Or, how could I get more than one type, or even _all?

Andrei Stefan · Accepted Answer

For all types you can use "es.resource": "myIndex_*".

For the wildcard part you would need a query:

     "prefix": {
        "_type": {
          "value": "test"
        }
      }

PySpark + ElasticSearch: Reading multiple index/type

Answers (1)

Related Questions