Jay
Jay

Reputation: 2454

Best practices for huge volumes of data load/unload?

My question applies to ETL scenarios, where transformation is performed outside of database (completely). If you were to Extract, Transform, and Load huge volumes of data (20+ million records or more) and the databases involved are : Oracle and MSSQL Server, what would be the best way to:

  1. Effectively read from the source database : Is there a way I could avoid all the querying over the network? I have heard good things about Direct Path Extract method/ bulk unload method - I'm quite not sure how they work, but I presume I would need a dump file of sorts for any kind of non-network based data read/import?
  2. Effectively write the transformed data to the target database?: Should I consider Apache Hadoop? Will it help me start my transformation and parallely load all my data to the destination database? - Would it be faster than say, Oracle's bulk load utility? If not,, is there a way to remote invoke bulk load utlities on Oracle/MSSQL Server?

Appreciate your thoughts/suggestions.

Upvotes: 1

Views: 3048

Answers (2)

Michael Dillon
Michael Dillon

Reputation: 32392

I would always use the db's bulk load facilities for this. Remote control of bulk loads is a sysadmin issue; there is always a way to do this.

This means that the first two stages of ETL would be and application that generates the right file format for the bulk load facility, and the last stage would be to invoke bulk loading.

Upvotes: 1

Jason Martin
Jason Martin

Reputation: 41

If you extract the data to a flat file, you can use Oracle External Tables to read the file into oracle directly. At that point you can do a direct-path insert (/*+ append */) that performs the necessary transformation. Also, if the table is nologging the you will save on redo costs, but need to take into the account that the data could be lost / need to be reloaded in the case of a media failure before the next backup.

Also, if the data is time-based, consider using partitioning and partition exchanges. You can load the data via the method above into a index-less staging table, create the necessary indexes (again possibly as NOLOGGING) which is faster than maintaining the indexes during the insert, then 'exchange' the partition into the master table.

Upvotes: 0

Related Questions