Stormcloud
Stormcloud

Reputation: 2307

Extracting Data from Oracle to Hadoop. Is Sqoop a good idea

I'm looking to extract some data from an Oracle database and transferring it to a remote HDFS file system. There appears to be a couple of possible ways of achieving this:

  1. Use Sqoop. This tool will extract the data, copy it across the network and store it directly into HDFS
  2. Use SQL to read the data and store in on the local file system. When this has been completed copy (ftp?) the data to the Hadoop system.

My question will the first method (which is less work for me) cause Oracle to lock tables for longer than required?

My worry is that that Sqoop might take out a lock on the database when it starts to query the data and this lock isn't going to be released until all of the data has been copied across to HDFS. Since I'll be extracting large amounts of data and copying it to a remote location (so there will be significant network latency) the lock will remain longer than would otherwise be required.

Upvotes: 2

Views: 968

Answers (1)

hadooper
hadooper

Reputation: 746

  • Sqoop issues usual select queries on the Oracle batabase, so it does the same locks as the select query would. No extra additional locking is performed by Sqoop.

  • Data will be transferred in several concurrent tasks(mappers). Any expensive function call will put a significant performance burden on your database server. Advanced functions could lock certain tables, preventing Sqoop from transferring data in parallel. This will adversely affect transfer performance.

  • For efficient advanced filtering, run the filtering query on your database prior to import, save its output to a temporary table and run Sqoop to import the temporary table into Hadoop without the —where parameter.

  • Sqoop import has nothing to do with copy of data accross the network. Sqoop stores at one location and based on the Replication Factor of the cluster HDFS replicates the data

Upvotes: 1

Related Questions