Reputation: 75
I am new to apache spark and have been investigating its use on a project. One of the project's bottlenecks is that I must iteratively fit a large number of different models (or the same model with variations on the hyper parameters) to a single small dataset thousands of times. Each dataset is composed of somewhere in the neighborhood of 100-1000 rows and about 10,000-100,000 columns so that a single fit is easily managed on a one machine(or worker node). The nice thing is that these tasks can be performed in isolation from one another so that they are embarrassingly parallelizable.
Due to the potentially limited gains from splitting the dataset into a rdd and large number of nodes which can be working at any given time, it seems like it would be more efficient to have many worker nodes operating independently from one another than to have them operate in unison for each fit. One option would be to cache the same dataset in memory for each of my worker nodes and have them operate independently to work through all the fits which must be performed. Tools like HTCondor+cloud scheduler or star-cluster seem to be able to handle this kind of task very well, but Apache Spark has a lot to offer in other areas so I am interested in its use for the project as a whole.
What I can't seem to find a great answer to is whether or not apache spark has the functionality to tackle a problem like this. This question seems to address a similar topic, but with my currently limited knowledge of spark I can't tell, from looking at the documentation of the library which is mentioned, whether or not this would help me with this problem. Knowing whether or not this is the right tool for the job initially will help save me the wasted time of learning a tool which I later realize cannot accomplish what I need it to.
Upvotes: 0
Views: 133
Reputation: 7452
It's pretty well supported, but its certainly not the standard use case that we talk about. If your data is really small (and it sounds like it is) you can look at using the broadcast
function to make it available to all the workers and then constructing an RDD of the different paramters you want to try and use when fitting your models. Then you can map
over the different parameters and train your models on multiple nodes in your cluster.
Upvotes: 1