Reputation: 21

How to feed hadoop periodically

My company is generating around 6 million of records per day and I have seen hadoop is a good solution to handle big amounts of data. I found how to load data from mysql but it is exporting full database, Is there a way to keep sync data between my operational mysql DB and Hadoop?

Upvotes: 1

Answers (2)

Karthik Balaguru

Reputation: 7842

There are multiple solutions available which you may need to choose as per your architectural requirement or deployment setup.

Debezium :

Debezium is a distributed platform deployed via Apache Kafka Connect that can help in continuously monitoring the various databases of your system and let the applications stream every row-level change in the same order they were committed to the database. It turns your existing databases into event streams, whereby the applications can see and accordingly respond to each row-level change in the databases.

Kafka Connect is a framework and runtime for implementing and operating source connectors such as Debezium, which ingest data into Kafka and sink connectors, which propagate data from Kafka topics into other systems.

For the case of MySQL, the Debezium's MySQL Connector can help in monitoring and recording all of the row-level changes in the databases on a MySQL server . All of the events for each table are recorded in a separate Kafka topic and the client applications can read the Kafka topics that correspond to the database tables it’s interested in following, and react to every row-level event it sees in those topics as per the requirement.

Once the data of interest is available in topics, the Kafka Connect HDFS Sink connector can be used to export the data from Kafka topics to HDFS files in a variety of formats as per your use case or requirement and integrates with Hive and when it is enabled. This connector helps application in periodically polling data from Apache Kafka and writing them to HDFS. This connector also automatically creates an external Hive partitioned table for each Kafka topic and updates the table according to the available data in HDFS.

Maxwell's daemon :

Maxwell's daemon is a CDC (Change Data Capture) application that reads MySQL binlogs (events from MyQSQL database) and writes row updates as JSON to Kafka or other streaming platforms . Once the data of interest is available in kafka topics, the Kafka Connect HDFS Sink connector can be used to export the data from Kafka topics to HDFS files.

NiFi :

Apache NiFi helps in automating the flow of data between systems. Apache NiFi CDC (Change Data Capture) flow also uses MySQL bin logs(via CaptureChangeMySQL) to create a copy of a table and ensures that it is in sync with row-level changes to the source. This inturn can be operated upon by NiFi PutHDFS for writing the data to HDFS.

Upvotes: 1

Mobin Ranjbar

Reputation: 1360

One of the best solution you can use is Debezium. Debezium is built on top of Apache Kafka Connect API and provides connectors that monitor specific databases.

It records all row-level changes within each database table in a change event stream, and applications simply read these streams to see the change events in the same order in which they occurred.

The Architecture will something like this:

MySQL --> Debezium(Kafka Connect Plugin) --> Kafka Topic --> HDFS Sink

You can find more information and documentation about Debezium Here.

Furthermore, Apache NiFi has a processor named CaptureChangeMySQL, You can design NiFi flow like below to do this:

MySQL --> CaptureChangeMySQL(Processor) --> PutHDFS(Processor)

You can read more about CaptureChangeMySQL Here.

Upvotes: 1

How to feed hadoop periodically

Answers (2)

Related Questions