Hadoop MapReduce DBInputFormat and DBOutputFormat

Question

I need to import data from MYSQL, run a MR and export it back to MYSQL. I am able to do it successfully in a single MR job for a few records using DBInputFormat and DBOutputFormat. When I scale input records to 100+ million records , MR job hangs. Alternative to this is export data to HDFS , run MR job and push back to My SQL.

For a huge dataset of around 400+ Million records ,which option is better one, using a DBInputFormat and DBOutputFormat or using HDFS as data source and destination.

Using a HDFS adds a step before and after my MR job. Since data is stored on HDFS it would be replicated(default 3) and will require more hard drive space. Thanks Rupesh

Binary01 · Accepted Answer

I think the best approach should be using SQOOP in dealing with such situation.Apache Sqoop is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases such as such as MySQL or Oracle.Sqoop automates most of this process, relying on the database to describe the schema for the data to be imported. Sqoop uses MapReduce to import and export the data, which provides parallel operation as well as fault tolerance. Please look into this link and explore Sqoop for detials. SQOOP details

In order to use DBInputFormat you need to write a class that deserializes the columns from the database record into individual data fields to work with. This is pretty tedious—and entirely algorithmic. Sqoop auto-generates class definitions to deserialze the data from the database. These classes can also be used to store the results in Hadoop’s SequenceFile format, which allows you to take advantage of built-in compression within HDFS too. The classes are written out as .java files that you can incorporate in your own data processing pipeline later. The class definition is created by taking advantage of JDBC’s ability to read metadata about databases and tables.

When Sqoop is invoked, it retrieves the table’s metadata, writes out the class definition for the columns you want to import, and launches a MapReduce job to import the table body proper.

Hadoop MapReduce DBInputFormat and DBOutputFormat

Answers (1)

Related Questions