SPadi
SPadi

Reputation: 11

Parallel, distributed processing with SparkR 2.0.0

I am new to Spark and trying the SparkR 2.0.0 on the RServer that is an Hadoop edge node. Creating and querying DataFrames are fine. But here is a problem i am trying to see how it works.

Given an item, i need to query external data sources for related data, machine learn with some ML lib call and dump the results. I need to do this learning on about 500 items. Obviously i want to use all CPUs on all worker nodes available so that 500 ML runs can happen in parallel. I noticed native ML call on open source R doesn't take much time for running that algo for the item's data set (~10000 rows typically - 1 minute in all to get data, run ML and deliver results i need).

Note that I am not invoking Spark's ML. But i was trying to see if i could use spark only for distributed parallel computation and see how fast i can learn. The alternative is also to load all the 500 items in Spark DataFrame and leave it to Spark to figure out how to run ML on the partitioned DataFrame. But thats a separate effort and study to compare how that performs in relation to multiple, parallel and distributed runs of mini MLs (1 for each item).

Question: How do we call parallelize in Spark R ? Do i have to use callJmethod passing a SparkDataFrame of items and invoke the function call for each item ? Or is there a better way to parallelize my collection of items and make a function call on each (like a parallel dApply) ? Any tips/help appreciated.

Sorry for long post. I am fairly new to Spark and there seem to be Scala/Java/R and Python ways and may be R approach is relatively limited to others that i havent caught up on. thanks!

Upvotes: 1

Views: 568

Answers (1)

user2280549
user2280549

Reputation: 1234

Did you try spark.lapply function (link - > spark.lapply)? Basically, it uses spark as a resource provider, not a tool for "big data" processing. If your data for 500 items can be handled in memory, you could create a list having 500 elements (each element having corresponding data + some other things like hyperparameters) and pass it to spark.lapply together with a proper function (like some machine learning model). What Spark should do is sth similar to parallel package (open separate RSessions on worker nodes, distribute your computations and give the results back to driver).

Upvotes: 1

Related Questions