es.read.source.filter v.s. es.read.field.include when reading data with elasticsearch-hadoop

Question

When reading data from Elasticsearch with elasticsearch-hadoop, there are two options two specify how to reading a subset of fields from the source, according to the offical documents, i.e,.

es.read.field.include: Fields/properties that are parsed and considered when reading the documents from Elasticsearch...;
es.read.source.filter: ...this property allows you to specify a comma delimited string of field names that you would like to return from Elasticsearch.

Both can be set as a comma-separated string. I have tested the two options, and find that only es.read.field.include works with expected results, but es.read.source.filter takes no effect (with all data fields returned).

Question: what is the difference of these two options? and why the option es.read.source.filter does not take effect? Here is the code to reading data

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('TestSparkElasticsearch').getOrCreate()
es_options = {"nodes": "node1,node2,node3", "port": 9200}
df = spark.read\
    .format("org.elasticsearch.spark.sql")\
    .options(**es_options)\
    .option("es.read.source.filter", 'sip,dip,sport,dport,protocol,tcp_flag')\
    .load("test_flow_tcp")
df.printSchema()  # return all the fields, the option not work

df = spark.read\
    .format("org.elasticsearch.spark.sql")\
    .options(**es_options)\
    .option("es.read.field.include", 'sip,dip,sport,dport,protocol,tcp_flag')\
    .load("test_flow_tcp")
df.printSchema()  # only return specified fields as expected

Updated: es.read.source.filter is added in Elasticsearch 5.4 and I am using version 5.1. Hence the option is ignored. Nevertheless, the difference of these two options in newer version is still not clearly explained in the document?

es.read.source.filter v.s. es.read.field.include when reading data with elasticsearch-hadoop

Answers (0)

Related Questions