Venkat Ankam
Venkat Ankam

Reputation: 926

Oracle to Hadoop data ingestion in real-time

I have a requirement to ingest the data from an Oracle database to Hadoop in real-time.

What's the best way to achieve this on Hadoop?

Upvotes: 5

Views: 4279

Answers (3)

ethesx
ethesx

Reputation: 1379

Expanding a bit on what @Nickolay mentioned, there are a few options, but the best would be too opinion based to state.

Tungsten (open source)

Tungsten Replicator is an open source replication engine supporting a variety of different extractor and applier modules. Data can be extracted from MySQL, Oracle and Amazon RDS, and applied to transactional stores, including MySQL, Oracle, and Amazon RDS; NoSQL stores such as MongoDB, and datawarehouse stores such as Vertica, Hadoop, and Amazon rDS.

Oracle GoldenGate

Oracle GoldenGate is a comprehensive software package for real-time data integration and replication in heterogeneous IT environments. The product set enables high availability solutions, real-time data integration, transactional change data capture, data replication, transformations, and verification between operational and analytical enterprise systems. It provides a handler for HDFS.

Dell Shareplex

SharePlex™ Connector for Hadoop® loads and continuously replicates changes from an Oracle® database to a Hadoop® cluster. This gives you all the benefits of maintaining a real-time or near real-time copy of source tables

Upvotes: 3

Nickolay
Nickolay

Reputation: 32063

The important problem here is getting the data out of the Oracle DB in real time. This is usually called Change Data Capture, or CDC. The complete solution depends on how you do this part.

Other things that matter for this answer are:

  • What is the target for the data and what are you going to do with it?
    • just store plain HDFS files and access for adhoc queries with something like Impala?
    • store in HBase for use in other apps?
    • use in a CEP solution like Storm?
    • ...
  • What tools is your team familiar with
    • Do you prefer the DIY approach, gluing together existing open-source tools and writing code for the missing parts?
    • or do you prefer a Data integration tool like Informatica?

Coming back to CDC, there are three different approaches to it:

  • Easy: if you don't need true real-time and have a way to identify new data with an SQL query that executes fast enough for the required data latency. Then you can run this query over and over and ingest its results (the exact method depends on the target, the size of each chunk, and the preferred tools)
  • Complicated: Roll your own CDC solution: download the database logs, parse them into series of inserts/updates/deletes, ingest these to Hadoop.
  • Expensive: buy a CDC solution, that does this for you (like GoldenGate or Attunity)

Upvotes: 4

Vijay Innamuri
Vijay Innamuri

Reputation: 4372

Apache Sqoop is a data transfer tool to transfer bulk data from any RDBMS with JDBC connectivity(supports Oracle also) to hadoop HDFS.

Upvotes: 0

Related Questions