How to ensure ordered processing of events using spark streaming?

I have a use-case where I want to process a large number of events. These events contain multiple attributes in them. However, I want to ensure that for a given attribute(key), there is not more than 1 spark execution running at a given time because if two executions run in parallel for the same key, the end result will be determined by a race condition.

My model is something like:

Receive a change event from some system.
Enrich the event using attributes in my local database.
Send the enrich event to spark streaming using Kinesis.
Update the local database with the output.

Is apace-storm a better contender for such a system?

Upvotes: 4

Answers (2)

az3

Reputation: 3629

Amazon Kinesis uses shards in the stream as data containers. And inside a shard, it is guaranteed that the values are processed sequentially.

You can exploit this feature for your use case: So use predefined "Partition Key" values while putting records in the stream.

For example, if you are dealing with user values, you can use the id for a user's event as partition key on the producer side.

User #1: First makes purchase, then updates score, after that browses to page X etc.
User #2: First does X, then does Y, after that Z event occurs etc.

That way, you'll be sure that the events for a single user will be processed in a timely manner. And you'll have your parallelism for different user's events (i.e. Kinesis Records).

Upvotes: 3

user7005835

Reputation: 85

You can have just one partition and by that stop parallelism.

Also from my opinion, for scenario like this Apache kafka is a better choice.

Upvotes: -2

How to ensure ordered processing of events using spark streaming?

Answers (2)

Related Questions