Parallel, distributed processing with SparkR 2.0.0

Question

I am new to Spark and trying the SparkR 2.0.0 on the RServer that is an Hadoop edge node. Creating and querying DataFrames are fine. But here is a problem i am trying to see how it works.

Given an item, i need to query external data sources for related data, machine learn with some ML lib call and dump the results. I need to do this learning on about 500 items. Obviously i want to use all CPUs on all worker nodes available so that 500 ML runs can happen in parallel. I noticed native ML call on open source R doesn't take much time for running that algo for the item's data set (~10000 rows typically - 1 minute in all to get data, run ML and deliver results i need).

Note that I am not invoking Spark's ML. But i was trying to see if i could use spark only for distributed parallel computation and see how fast i can learn. The alternative is also to load all the 500 items in Spark DataFrame and leave it to Spark to figure out how to run ML on the partitioned DataFrame. But thats a separate effort and study to compare how that performs in relation to multiple, parallel and distributed runs of mini MLs (1 for each item).

Question: How do we call parallelize in Spark R ? Do i have to use callJmethod passing a SparkDataFrame of items and invoke the function call for each item ? Or is there a better way to parallelize my collection of items and make a function call on each (like a parallel dApply) ? Any tips/help appreciated.

Sorry for long post. I am fairly new to Spark and there seem to be Scala/Java/R and Python ways and may be R approach is relatively limited to others that i havent caught up on. thanks!

Parallel, distributed processing with SparkR 2.0.0

Answers (1)

Related Questions