lserlohn
lserlohn

Reputation: 6206

how can I implement "by" function in parallel, or preferably by sparkR

I need to apply "by" function on a very big data set. The data looks like:

id    value1   value2
1      245      446
1      592      567
1      356      642
...    ...
2      231      421
2      423      425
2      421      542

I need to calculate something between value1 and value2 for each id, so that:

by(dataset, id, function(input) myfun(input$value1, input$value2) 

However, the data is very very big. Such computation requires a lot of time.

I would like to know, is there anyway to speed up this function.

I want to use parallel method, preferably using sparkR. But I don't know how to get this done. Can SparkR support it?

Upvotes: 0

Views: 181

Answers (1)

zero323
zero323

Reputation: 330303

Long story short there is no supported of translating by / tapply with an arbitrary function to SparkR. At this moment (Spark 1.5 / 1.6 Preview) SparkR exposes only a limited subset of Spark SQL API which is more or less a distribute SQL query engine.

I function you want to use can be expressed without R using standard SQL logic with GROUP BY and / or window functions then you're good to go.

Some people tend to use internal RDD API which provides operations like aggregateByKey / combineByKey, reduceByKey or groupByKey. Personally I would strongly advise against that. This part of the API is much less mature compared to its Scala or Python counterparts, lacks some basic features and is significantly slower.

While software recommendations are off-topic for SO there multiple R libraries you may find useful including parallel, snow, doMC, Rmpi and brand fresh multidplyr. Including great storage options like data.table or ff and R independent solutions like GNU Parallel you have plenty of options. Since problem you're trying to solve is embarrassingly parallel using some combination of these tools should give much higher ROI than tinkering with SparkR internals.

Upvotes: 1

Related Questions