Reputation: 50508
There are many, many ways of transferring data into a hadoop cluster - including, for example, writing the data programmatically (say, through a library), transferring through an odbc connector (say, for example, the one included in sqoop), through Thrift, or through command line tools.
How do the various data transfer options compare, for large scale, raw data, transfer capabilities?
For context:
I'm looking to schedule an irregular process, transferring ~3TB of data into a Hadoop cluster.
There aren't many requirements - only that I transfer the data in as quickly as possible; the data transfer step is the most significant bottleneck here. The data can be transferred to anywhere on the cluster - files on HDFS or as more structured data on Hbase.
I have a choice whether to load the data from a transactional database, or from a set of CSV files sitting in a file system, and some flexibility to try other alternatives if they promise significant performance increases.
I've looked at the available options and have some gut-feelings about what would work best, but would love to see any measurements of performance test information, if available.
Upvotes: 2
Views: 533
Reputation: 6418
I'd say that uploading compressed CSV into HDFS with hadoop -fs ...
command will be the fastest option. In this scenario network bandwidth is the only factor that limits transfer rate.
All other options may only add overhead to the size of transferred data. Some of them may add no overhead, but executing a console command is simple, and why complicate things?
After the data have been uploaded into HDFS they may be transformed as necessary or converted to HBase with Pig or Map/Reduce. Any transformation of HDFS data will be faster compared to transformation of data residing on local file system, since the processing will be paralleled and (most likely) it will occur locally on the nodes that store corresponding data chunks.
Upvotes: 2