Reputation: 1291
I am trying to use Apache spark to create an index in Elastic search(Writing huge data to ES).I have done a Scala program to create index using Apache spark.I have to index huge data, which is getting as my product bean in a LinkedList. Then.Then i tried to traverse over the product bean list and create the index. My code given below.
val conf = new SparkConf().setAppName("ESIndex").setMaster("local[*]")
conf.set("es.index.auto.create", "true").set("es.nodes", "127.0.0.1")
.set("es.port", "9200")
.set("es.http.timeout", "5m")
.set("es.scroll.size", "100")
val sc = new SparkContext(conf)
//Return my product bean as a in a linkedList.
val list: util.LinkedList[product] = getData()
for (item <- list) {
sc.makeRDD(Seq(item)).saveToEs("my_core/json")
}
The issue with this approach is taking too much time to create the index. Is there any way to create the index in a better way?
Upvotes: 1
Views: 926
Reputation: 330063
Don't pass data through driver unless it is necessary. Depending on what is the source of data returned from getData
you should use relevant input method or create your own. If data comes from MongoDB use for example mongo-hadoop
, Spark-MongoDB
or Drill with JDBC connection. Then use map
or similar method to build the required objects and use saveToEs
on transformed RDD.
Creating a RDD with as single element doesn't make sense. It doesn't benefit from Spark architecture at all. You just start a potentially huge number of tasks which have nothing with only a single active executor.
Upvotes: 3