user1189332
user1189332

Reputation: 1941

A dynamic message orchestration flow engine for Kafka

I'm trying to see what are the possible toolkits/frameworks available to achieve the following.

  1. A toolkit where a developer typically should configure the data flow (which is a series of steps) to form a data processing pipeline. A declarative approach with zero or very minimal coding.
  2. The underlying messaging infrastructure should be Kafka - ie the toolkit should support Kafka straight out of the box (when the right dependencies are included).
  3. Very intuitive to visualise, deploy, debug the flows.
  4. Aggregation capabilities (group by) etc on streaming data.

I'm seeing Spring Cloud Data Flow as something that could (possibly) tried out as a candidate? Is this what it is meant for (from people using it on production)?

Are there any free/opensource alternatives too?

Upvotes: 2

Views: 437

Answers (1)

Sabby Anandan
Sabby Anandan

Reputation: 5651

I will attempt to unpack a few topics in the context of Spring Cloud Data Flow (SCDF).

A toolkit where a developer typically should configure the data flow (which is a series of steps) to form a data processing pipeline. A declarative approach with zero or very minimal coding.

There are ~70 data integration applications that we maintain and ship. They should cover the most common use-cases. Each of them is a Spring Cloud Stream application, and the business logic in them can work as-is with a variety of message brokers that the framework supports, including Kafka and Kafka Streams.

However, when you have a custom data processing requirement and there's no application to address that need, you will have to build a custom source, processor, or sink style of apps. If you don't want to use Java, polyglot workloads are possible, as well.

SCDF allows you to assemble the applications into a coherent streaming data pipeline [see streams developer guide]. SCDF then orchestrates the deployment of the apps in the data pipeline to targeted platforms like Kubernetes as native resources.

Because these applications are connected with one another through persistent pub/sub-brokers (eg: Kafka), SCDF also provides the primitives to CI/CD, rolling-upgrade, and rolling-rollback the individual applications in the streaming data pipeline without causing upstream or downstream impacts. The data ordering and guarantees are preserved also because we rely upon and delegate that to the underlying message broker.

The underlying messaging infrastructure should be Kafka - ie the toolkit should support Kafka straight out of the box (when the right dependencies are included).

This is already covered in the previous answer. The point to note here, though, in the future, if you want to switch from Kafka to let's say Azure Event Hubs, there's absolutely zero code change required in the business logic. Spring Cloud Stream workload is portable, and you're not locking yourself into a single tech like Kafka.

Very intuitive to visualise, deploy, debug the flows

SCDF supports a drag+drop interface, integration with observability tooling such as Prometheus+Grafna, and the metrics based auto-scaling of applications in the data pipeline.

All of the above is also possible to accomplish by directly using SCDF's APIs, Java DSL (programmatic creation of data pipelines — critical for CI/CD automation), or Shell/CLI.

Aggregation capabilities (group by) etc on streaming data

When using Kafka Streams binder implementation, you can build comprehensive joins, aggregations, and stateful analytics — see samples.

Upvotes: 4

Related Questions