Reputation: 484
I have a problem with reading data from Elasticsearch into Spark cluster (I'm using Zeppelin environment, so all connection settings are configured in the Zeppelin interpreter settings).
First, I have tried to read it with PySpark:
%pyspark
from pyspark.sql import SQLContext
from pyspark.sql import functions as F
df = spark.read.format("org.elasticsearch.spark.sql").load("index")
df = df.limit(100).drop('tags').drop('a.b')
# if 'tags' field is not dropped, pyspark cannot map scala field and throws an exception.
# If the limit is not set, pyspark will probably try to get the whole index at once
# if "a.b" is not dropped, the dot in the field name causes mapping error: https://github.com/elastic/elasticsearch-hadoop/issues/853
df = df.cache()
z.show(df)
Unfortunately, in this case I face many mapping issues. Cause I have a lot of fields containing dots in the dataset, I decided to give Scala a try to read the data (in order to process it in PySpark later):
%spark
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.elasticsearch.spark._
import org.apache.spark.sql.SQLContext
import org.elasticsearch.spark
import org.elasticsearch.spark.sql
import org.apache.spark.serializer.KryoSerializer
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.Encoder
val conf = new SparkConf()
conf.set("spark.es.mapping.date.rich", "false");
conf.set("spark.serializer", classOf[KryoSerializer].getName)
val EsReadRDD = sc.esRDD("index")
However, even with Scala I can only retrieve small numbers of records, like
EsReadRDD.take(10).foreach(println)
For some reason, collect() does not work:
val esdf = EsReadRDD.collect() //does not work probably because data are too large
The error is:
Job aborted due to stage failure: Task 0 in stage 833.0 failed 4 times, most recent failure: Lost task 0.3 in stage 833.0 (TID 479, 10.10.11.37, executor 5): ExecutorLostFailure (executor 5 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
I have also tried conversion to DF, but get an error:
val esdf = EsReadRDD.toDF()
java.lang.UnsupportedOperationException: No Encoder found for scala.AnyRef
- map value class: "java.lang.Object"
- field (class: "scala.collection.Map", name: "_2")
- root class: "scala.Tuple2"
Do you have any idea on how to deal with it?
Upvotes: 0
Views: 2410