We are developing Spark framework wherein we are moving historical data into RDD sets. Basically, RDD is immutable, read only dataset on which we do operations. Based on that we have moved historical data into RDD and we do computations like filtering/mapping, etc on such RDDs. Now there is a use case where a subset of the data in the RDD gets updated and we have to recompute the values. HistoricalData is in the form of RDD. I create another RDD based on request scope and save the reference of that RDD in a ScopeCollection So far I have been able to think of below approaches - Approach1: broadcast the change: For each change request, my server fetches the scope specific RDD and spawns a job In a job, apply a map phase on that RDD - 2.a. for each node in the RDD do a lookup on the broadcast and create a new Value which is now updated, thereby creating a new RDD 2.b. now I do all the computations again on this new RDD at step2.a. like multiplication, reduction etc 2.c. I Save this RDDs reference back in my ScopeCollection Approach2: create an RDD for the updates For each change request, my server fetches the scope specific RDD and spawns a job On each RDD, do a join with the new RDD having changes now I do all the computations again on this new RDD at step2 like multiplication, reduction etc Approach 3: I had thought of creating streaming RDD where I keep updating the same RDD and do re-computation. But as far as I understand it can take streams from Flume or Kafka. Whereas in my case the values are generated in the application itself based on user interaction. Hence I cannot see any integration points of streaming RDD in my context. Any suggestion on which approach is better or any other approach suitable for this scenario. TIA!

Reputation: 245

How to update an RDD?

We are developing Spark framework wherein we are moving historical data into RDD sets.

Basically, RDD is immutable, read only dataset on which we do operations. Based on that we have moved historical data into RDD and we do computations like filtering/mapping, etc on such RDDs.

Now there is a use case where a subset of the data in the RDD gets updated and we have to recompute the values.

HistoricalData is in the form of RDD. I create another RDD based on request scope and save the reference of that RDD in a ScopeCollection

So far I have been able to think of below approaches -

Approach1: broadcast the change:

For each change request, my server fetches the scope specific RDD and spawns a job
In a job, apply a map phase on that RDD -

2.a. for each node in the RDD do a lookup on the broadcast and create a new Value which is now updated, thereby creating a new RDD
2.b. now I do all the computations again on this new RDD at step2.a. like multiplication, reduction etc
2.c. I Save this RDDs reference back in my ScopeCollection

Approach2: create an RDD for the updates

For each change request, my server fetches the scope specific RDD and spawns a job
On each RDD, do a join with the new RDD having changes
now I do all the computations again on this new RDD at step2 like multiplication, reduction etc

Approach 3:

I had thought of creating streaming RDD where I keep updating the same RDD and do re-computation. But as far as I understand it can take streams from Flume or Kafka. Whereas in my case the values are generated in the application itself based on user interaction. Hence I cannot see any integration points of streaming RDD in my context.

Any suggestion on which approach is better or any other approach suitable for this scenario.

TIA!

Upvotes: 21

Answers (2)

evgenii

Reputation: 1235

I suggest to take a look at IndexedRDD implementation, which provides updatable RDD of key value pairs. That might give you some insights.

The idea is based on the knowledge of the key and that allows you to zip your updated chunk of data with the same keys of already created RDD. During update it's possible to filter out previous version of the data.

Having historical data, I'd say you have to have sort of identity of an event.

Regarding streaming and consumption, it's possible to use TCP port. This way the driver might open a TCP connection spark expects to read from and sends updates there.

Upvotes: 1

maasg

Reputation: 37435

The usecase presented here is a good match for Spark Streaming. The two other options bear the question: "How do you submit a re-computation of the RDD?"

Spark Streaming offers a framework to continuously submit work to Spark based on some stream of incoming data and preserve that data in RDD form. Kafka and Flume are only two possible Stream sources.

You could use Socket communication with the SocketInputDStream, reading files in a directory using FileInputDStream or even using shared Queue with the QueueInputDStream. If none of those options fit your application, you could write your own InputDStream.

In this usecase, using Spark Streaming, you will read your base RDD and use the incoming dstream to incrementally transform the existing data and maintain an evolving in-memory state. dstream.transform will allow you to combine the base RDD with the data collected during a given batch interval, while the updateStateByKey operation could help you build an in-memory state addressed by keys. See the documentation for further information.

Without more details on the application is hard to go up to the code level on what's possible using Spark Streaming. I'd suggest you to explore this path and make new questions for any specific topics.

Upvotes: 9

How to update an RDD?

Answers (2)

Related Questions