How do the various large data transfer methods for hadoop compare to each other?

Question

There are many, many ways of transferring data into a hadoop cluster - including, for example, writing the data programmatically (say, through a library), transferring through an odbc connector (say, for example, the one included in sqoop), through Thrift, or through command line tools.

How do the various data transfer options compare, for large scale, raw data, transfer capabilities?

For context:

I'm looking to schedule an irregular process, transferring ~3TB of data into a Hadoop cluster.

There aren't many requirements - only that I transfer the data in as quickly as possible; the data transfer step is the most significant bottleneck here. The data can be transferred to anywhere on the cluster - files on HDFS or as more structured data on Hbase.

I have a choice whether to load the data from a transactional database, or from a set of CSV files sitting in a file system, and some flexibility to try other alternatives if they promise significant performance increases.

I've looked at the available options and have some gut-feelings about what would work best, but would love to see any measurements of performance test information, if available.

How do the various large data transfer methods for hadoop compare to each other?

Answers (1)

Related Questions