Reputation: 512
I have 24GB folderin my local file system. My task is to move that folder to HDFS. Two ways I did it. 1) hdfs dfs -copyFromLocal /home/data/ /home/
This took around 15mins to complete.
2) Using Flume.
Here is my agent
spool_dir.sources = src-1
spool_dir.channels = channel-1
spool_dir.sinks = sink_to_hdfs
# source
spool_dir.sources.src-1.type = spooldir
spool_dir.sources.src-1.channels = channel-1
spool_dir.sources.src-1.spoolDir = /home/data/
spool_dir.sources.src-1.fileHeader = false
# HDFS sinks
spool_dir.sinks.sink_to_hdfs.type = hdfs
spool_dir.sinks.sink_to_hdfs.hdfs.fileType = DataStream
spool_dir.sinks.sink_to_hdfs.hdfs.path = hdfs://192.168.1.71/home/user/flumepush
spool_dir.sinks.sink_to_hdfs.hdfs.filePrefix = customevent
spool_dir.sinks.sink_to_hdfs.hdfs.fileSuffix = .log
spool_dir.sinks.sink_to_hdfs.hdfs.batchSize = 1000
spool_dir.channels.channel-1.type = file
spool_dir.channels.channel-1.checkpointDir = /home/user/spool_dir_checkpoint
spool_dir.channels.channel-1.dataDirs = /home/user/spool_dir_data
spool_dir.sources.src-1.channels = channel-1
spool_dir.sinks.sink_to_hdfs.channel = channel-1
This step took almost an hour to push data to HDFS.
As per my knowledge Flume is distributed, so should not it be that Flume should load data faster than copyFromLocal command.
Upvotes: 0
Views: 353
Reputation: 30089
If you're looking simple at read and write operations flume is going to be at least 2x slower with your configuration as you're using a file channel - every file read from disk is encapsulated into a flume event (in memory) and then serialized back down to disk via the file channel. The sink then reads the event back from the file channel (disk) before pushing it up to hdfs.
You also haven't set a blob deserializer on your spoolDir source (so it's reading one line at a time from your source files, wrapping in a flume Event and then writing to the file channel), so paired with the HDFS Sink default rollXXX values, you'll be getting a file in hdfs per 10 events / 30s / 1k rather than a file per input file that you'd get with copyFromLocal.
All of these factors add up to give you slower performance. If you want to get a more comparable performance, you should use the BlobDeserializer on the spoolDir source, coupled with a memory channel (but understand that a memory channel doesn't guarantee delivery of an event in the event of the JRE being prematurely terminated.
Upvotes: 1
Reputation: 580
Apache Flume is not intended for moving or copying folders from local file system to HDFS. Flume is meant for efficiently collecting, aggregating and moving large amounts of log data from many different sources to a centralized data store. (Reference: Flume User Guide)
If you want to move large files or directories, you should use hdfs dfs -copyFromLocal
as you have already mentioned.
Upvotes: 0