Reputation: 6206
I need to apply "by" function on a very big data set. The data looks like:
id value1 value2
1 245 446
1 592 567
1 356 642
... ...
2 231 421
2 423 425
2 421 542
I need to calculate something between value1 and value2 for each id, so that:
by(dataset, id, function(input) myfun(input$value1, input$value2)
However, the data is very very big. Such computation requires a lot of time.
I would like to know, is there anyway to speed up this function.
I want to use parallel method, preferably using sparkR. But I don't know how to get this done. Can SparkR support it?
Upvotes: 0
Views: 181
Reputation: 330303
Long story short there is no supported of translating by
/ tapply
with an arbitrary function to SparkR. At this moment (Spark 1.5 / 1.6 Preview) SparkR exposes only a limited subset of Spark SQL API which is more or less a distribute SQL query engine.
I function you want to use can be expressed without R using standard SQL logic with GROUP BY
and / or window functions then you're good to go.
Some people tend to use internal RDD API which provides operations like aggregateByKey
/ combineByKey
, reduceByKey
or groupByKey
. Personally I would strongly advise against that. This part of the API is much less mature compared to its Scala or Python counterparts, lacks some basic features and is significantly slower.
While software recommendations are off-topic for SO there multiple R libraries you may find useful including parallel
, snow
, doMC
, Rmpi
and brand fresh multidplyr
. Including great storage options like data.table
or ff
and R independent solutions like GNU Parallel you have plenty of options. Since problem you're trying to solve is embarrassingly parallel using some combination of these tools should give much higher ROI than tinkering with SparkR internals.
Upvotes: 1