Andy Dufresne
Andy Dufresne

Reputation: 6180

Real Time Streaming With Multiple Data Sources Using Kafka

We are planning to build a real time monitoring system with apache kafka. The overall idea is to push data from multiple data sources to kafka and perform data quality checks. I have few questions with this architecture

  1. What are the best possible approaches of streaming data from multiple sources which mainly include java applications, oracle database, rest api's, log files to apache kafka? Note each client deployment includes each of such data sources. Hence the number of data sources pushing data to kafka would be equal to the number of customers * x where x are the types of data sources that I listed. Ideally a push approach would suit best instead of a pull approach. In the pull approach the target system would have to be configured with the credentials of various source system which would not be practical
  2. How do we handle failures?
  3. How do we perform data quality checks on the incoming messages? For e.g. If a certain message does not have all the required attributes, the message could be discarded and an alert could be raised for the maintenance team to check.

Kindly let me know your expert inputs. Thanks !

Upvotes: 3

Views: 1453

Answers (1)

Quentin Geff
Quentin Geff

Reputation: 829

I think the best approach here is to use Kafka connect: link but it's a pull approach : Kafka Connect sources are pull-based for a few reasons. First, although connectors should generally run continuously, making them pull-based means that the connector/Kafka Connect decides when data is actually pulled, which allows for things like pausing connectors without losing data, brief periods of unavailability as connectors are moved, etc. Second, in distributed mode the tasks that pull data may need to be rebalanced across workers, which means they won't have a consistent location or address. While in standalone mode you could guarantee a fixed network endpoint to work with (and point other services at), this doesn't work in distributed mode where tasks can be moving around between workers. Ewen

Upvotes: 1

Related Questions