Reputation: 4230
I have a use-case where I want to process a large number of events. These events contain multiple attributes in them. However, I want to ensure that for a given attribute(key), there is not more than 1 spark execution running at a given time because if two executions run in parallel for the same key, the end result will be determined by a race condition.
My model is something like:
Is apace-storm a better contender for such a system?
Upvotes: 4
Views: 805
Reputation: 3629
Amazon Kinesis uses shards in the stream as data containers. And inside a shard, it is guaranteed that the values are processed sequentially.
You can exploit this feature for your use case: So use predefined "Partition Key" values while putting records in the stream.
For example, if you are dealing with user values, you can use the id for a user's event as partition key on the producer side.
That way, you'll be sure that the events for a single user will be processed in a timely manner. And you'll have your parallelism for different user's events (i.e. Kinesis Records).
Upvotes: 3
Reputation: 85
You can have just one partition and by that stop parallelism.
Also from my opinion, for scenario like this Apache kafka is a better choice.
Upvotes: -2