Adi
Adi

Reputation: 4230

How to ensure ordered processing of events using spark streaming?

I have a use-case where I want to process a large number of events. These events contain multiple attributes in them. However, I want to ensure that for a given attribute(key), there is not more than 1 spark execution running at a given time because if two executions run in parallel for the same key, the end result will be determined by a race condition.

My model is something like:

Is apace-storm a better contender for such a system?

Upvotes: 4

Views: 805

Answers (2)

az3
az3

Reputation: 3629

Amazon Kinesis uses shards in the stream as data containers. And inside a shard, it is guaranteed that the values are processed sequentially.

You can exploit this feature for your use case: So use predefined "Partition Key" values while putting records in the stream.

For example, if you are dealing with user values, you can use the id for a user's event as partition key on the producer side.

  • User #1: First makes purchase, then updates score, after that browses to page X etc.
  • User #2: First does X, then does Y, after that Z event occurs etc.

That way, you'll be sure that the events for a single user will be processed in a timely manner. And you'll have your parallelism for different user's events (i.e. Kinesis Records).

Upvotes: 3

user7005835
user7005835

Reputation: 85

You can have just one partition and by that stop parallelism.

Also from my opinion, for scenario like this Apache kafka is a better choice.

Upvotes: -2

Related Questions