Spark with Kafka streaming save to Elastic search slow performance

Question

I have a list of data, the value is basically a bson document (think json), each json ranges from 5k to 20k in size. It either can be in bson object format or can be converted to json directly:

Key, Value
--------
K1, JSON1
K1, JSON2
K2, JSON3
K2, JSON4

I expect the groupByKey would produce:

K1, (JSON1, JSON2)
K2, (JSON3, JSON4)

so that when I do:

val data = [...].map(x => (x.Key, x.Value))
val groupedData = data.groupByKey()
groupedData.foreachRDD { rdd =>
   //the elements in the rdd here are not really grouped by the Key
}

I am so confused the the behaviour of the RDD. I read many articles in the internet including the official website from Spark: https://spark.apache.org/docs/0.9.1/scala-programming-guide.html

Still couldn't achieve what I want.

-------- UPDATED ---------------------

Basically I really need it to be grouped by the key, the key is the index to be used in Elasticsearch, so that I can perform batch process based on the key via Elasticsearch for Hadoop:

EsSpark.saveToEs(rdd);

I can't do per partition because Elasticsearch only accept RDD. I tried to use sc.MakeRDD or sc.parallize, both telling me it is not serializable.

I tried to use:

EsSpark.saveToEs(rdd, Map(
          "es.resource.write" -> "{TheKeyFromTheObjectAbove}",
          "es.batch.size.bytes" -> "5000000")

Documentation of the config is here: https://www.elastic.co/guide/en/elasticsearch/hadoop/current/configuration.html

But it is VERY slow comparing to not using the configuration to define dynamic index based on the value of individual document, I suspect it is parsing every json to fetch the value dynamically.

Spark with Kafka streaming save to Elastic search slow performance

Answers (1)

Related Questions