Reputation: 341
When reading data from Elasticsearch with elasticsearch-hadoop
, there are two options two specify how to reading a subset of fields from the source, according to the offical documents, i.e,.
es.read.field.include
: Fields/properties that are parsed and considered when reading the documents from Elasticsearch...;es.read.source.filter
: ...this property allows you to specify a comma delimited string of field names that you would like to return from Elasticsearch.Both can be set as a comma-separated string.
I have tested the two options, and find that only es.read.field.include
works with expected results, but es.read.source.filter
takes no effect (with all data fields returned).
Question: what is the difference of these two options? and why the option es.read.source.filter
does not take effect?
Here is the code to reading data
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('TestSparkElasticsearch').getOrCreate()
es_options = {"nodes": "node1,node2,node3", "port": 9200}
df = spark.read\
.format("org.elasticsearch.spark.sql")\
.options(**es_options)\
.option("es.read.source.filter", 'sip,dip,sport,dport,protocol,tcp_flag')\
.load("test_flow_tcp")
df.printSchema() # return all the fields, the option not work
df = spark.read\
.format("org.elasticsearch.spark.sql")\
.options(**es_options)\
.option("es.read.field.include", 'sip,dip,sport,dport,protocol,tcp_flag')\
.load("test_flow_tcp")
df.printSchema() # only return specified fields as expected
Updated: es.read.source.filter
is added in Elasticsearch 5.4 and I am using version 5.1. Hence the option is ignored. Nevertheless, the difference of these two options in newer version is still not clearly explained in the document?
Upvotes: 1
Views: 562