How to read 1M records from Elasticsearch into PySpark?

Question

I have a problem with reading data from Elasticsearch into Spark cluster (I'm using Zeppelin environment, so all connection settings are configured in the Zeppelin interpreter settings).

First, I have tried to read it with PySpark:

%pyspark
from pyspark.sql import SQLContext
from pyspark.sql import functions as F

df = spark.read.format("org.elasticsearch.spark.sql").load("index")
df = df.limit(100).drop('tags').drop('a.b')
# if 'tags' field is not dropped, pyspark cannot map scala field and throws an exception.
# If the limit is not set, pyspark will probably try to get the whole index at once
# if "a.b" is not dropped, the dot in the field name causes mapping error: https://github.com/elastic/elasticsearch-hadoop/issues/853

df = df.cache()
z.show(df)

Unfortunately, in this case I face many mapping issues. Cause I have a lot of fields containing dots in the dataset, I decided to give Scala a try to read the data (in order to process it in PySpark later):

%spark
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.elasticsearch.spark._
import org.apache.spark.sql.SQLContext
import org.elasticsearch.spark
import org.elasticsearch.spark.sql
import org.apache.spark.serializer.KryoSerializer
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.Encoder

val conf = new SparkConf()

conf.set("spark.es.mapping.date.rich", "false");
conf.set("spark.serializer", classOf[KryoSerializer].getName)

val EsReadRDD = sc.esRDD("index")

However, even with Scala I can only retrieve small numbers of records, like

EsReadRDD.take(10).foreach(println)

For some reason, collect() does not work:

val esdf = EsReadRDD.collect() //does not work probably because data are too large

The error is:

Job aborted due to stage failure: Task 0 in stage 833.0 failed 4 times, most recent failure: Lost task 0.3 in stage 833.0 (TID 479, 10.10.11.37, executor 5): ExecutorLostFailure (executor 5 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.

I have also tried conversion to DF, but get an error:

val esdf = EsReadRDD.toDF()

java.lang.UnsupportedOperationException: No Encoder found for scala.AnyRef
- map value class: "java.lang.Object"
- field (class: "scala.collection.Map", name: "_2")
- root class: "scala.Tuple2"

Do you have any idea on how to deal with it?

How to read 1M records from Elasticsearch into PySpark?

Answers (0)

Related Questions