This is a question to solicit ideas about implementation options. We are migrating a system which currently uses Spark Streaming. In designing the new system, we are debating the two implementation options: continue to use Spark Streaming use a regular daemon process instead Our use case: we have a data store that constantly produces updates that we'll need to consume. The size and frequency of our data will only grow bigger and faster. I've access to our current Spark job web UI, please let me know if any metrics/data are needed to support either option. Thanks!

javaapache-sparkspark-streamingdistributed-computingdistributed

Reputation: 3576

Spark or a traditional daemon to process stream updates?

This is a question to solicit ideas about implementation options.

We are migrating a system which currently uses Spark Streaming. In designing the new system, we are debating the two implementation options:

continue to use Spark Streaming
use a regular daemon process instead

Our use case: we have a data store that constantly produces updates that we'll need to consume. The size and frequency of our data will only grow bigger and faster.

I've access to our current Spark job web UI, please let me know if any metrics/data are needed to support either option.

Thanks!

Upvotes: 0

Answers (1)

Bartosz Konieczny

Reputation: 2033

Thanks for the comment.

If you need to only capture the data and move it elsewhere, the daemon-based solution may work. Still, your data source must allow an easy add of new consumer, like for instance Apache Kafka does with consumer groups. In that a case you can simply deploy a new container wherever you want (Kubernetes, Mesos, ECS, ...) and let your source distribute the workload to new consumers. Seems fine.

But if you want to make some complex things like stateful aggregates, grouped operations, it will be hard to reimplement everything from scratch and also maintain that afterwards. And IMO even if now you know that you won't need that, nothing guarantees you that it will be true forever. In addition to that, you will need to adapt your custom consumer to every new release whereas in case of an Open Source solution, most of time is handled by the community.

If your concern is scaling, Apache Spark will scale accordingly to the underlying data store partitions distribution. So if you add new partitions in your Kafka topic, Apache Spark should scale accordingly - I agree that doing this automatically is not an easy piece of cake but still here you focus only on 1 problem (auto-scaling) and in the previous option auto-scaling is one of many points to implement.

Moreover, you say to have an Apache Spark expertise in your team, so it makes sense to keep it.

Hope it helps a little in your decision making process.

Could you share later the decision you took and explain it shortly?

Upvotes: 0

Spark or a traditional daemon to process stream updates?

Answers (1)

Related Questions