Gary Wang
Gary Wang

Reputation: 341

es.read.source.filter v.s. es.read.field.include when reading data with elasticsearch-hadoop

When reading data from Elasticsearch with elasticsearch-hadoop, there are two options two specify how to reading a subset of fields from the source, according to the offical documents, i.e,.

Both can be set as a comma-separated string. I have tested the two options, and find that only es.read.field.include works with expected results, but es.read.source.filter takes no effect (with all data fields returned).

Question: what is the difference of these two options? and why the option es.read.source.filter does not take effect? Here is the code to reading data

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('TestSparkElasticsearch').getOrCreate()
es_options = {"nodes": "node1,node2,node3", "port": 9200}
df = spark.read\
    .format("org.elasticsearch.spark.sql")\
    .options(**es_options)\
    .option("es.read.source.filter", 'sip,dip,sport,dport,protocol,tcp_flag')\
    .load("test_flow_tcp")
df.printSchema()  # return all the fields, the option not work

df = spark.read\
    .format("org.elasticsearch.spark.sql")\
    .options(**es_options)\
    .option("es.read.field.include", 'sip,dip,sport,dport,protocol,tcp_flag')\
    .load("test_flow_tcp")
df.printSchema()  # only return specified fields as expected

Updated: es.read.source.filter is added in Elasticsearch 5.4 and I am using version 5.1. Hence the option is ignored. Nevertheless, the difference of these two options in newer version is still not clearly explained in the document?

Upvotes: 1

Views: 562

Answers (0)

Related Questions