theMJof91
theMJof91

Reputation: 126

Import Oracle Data To Hadoop

Forewarning: I am new to the data science/data engineer/hadoop field; I have been looking for solutions to import large amounts of data (millions of rows) into a Hadoop cluster we have set up. The database is Oracle, and I have been exploring many ways to run such a job; the end goal is to get as closely to a real-time import of data within the cluster.

The first job I have seen here is through the use of a shell file utilizing sqlplus. I created this sql script to run and take the tables and dates as working parameters, and then on running a shell script for the specific tables we want to see daily in our cluster. This task could then be scheduled to be automated on a nightly basis. The problem with this approach, is that it seems to be taking so slow. From my knowledge, UTL_FILE and Pro*C are similarly slow.

This leads me to the discovery of Sqoop, which we do have installed, but what I am failing to understand is WHY it is so much faster than sqlplus. With SQLPlus, I am simply selecting the table statement and spooling this into a text-delimited file (say csv). This is one process. In Sqoop, how is it doing the multiple parallel calls?

Any insights into this problem would be much appreciated!

Upvotes: 0

Views: 495

Answers (2)

Sachin Janani
Sachin Janani

Reputation: 1319

Sqoop avoids reading of same oracle block by utilizing the concept of rowid. In doing so,it ensures that no two parallel running tasks will read data from the same Oracle block. This lowers disk operations on the database server, significantly improving performance.

Upvotes: 0

HAL 9000
HAL 9000

Reputation: 3985

Your knowledge that "UTL_FILE and Pro*C are similarly slow" ... how did you learn it? What did you measure? I'm wondering why Pro*C should be slow? It's C after all.

When the import into hadoop HDFS is faster than the import into your local filesystem - and they are both on comparable disks - then it's certainly because one is doing sequential writes (appending blocks) where the other is writing random blocks...

As you want to get close to a real-time import, you'd have to maintain some kind of change tracking. The best way to do this in Oracle is by using materialized views logs.

Upvotes: 1

Related Questions