Eran Witkon
Eran Witkon

Reputation: 4092

How to copy synchronized-files to HDFS using flume?

I have a directory tree with two directories and synchronized files in them:

home/dirMaster/file1.txt
home/dirMaster/file2.txt
home/dirSlave/file1-slave.txt
home/dirSlave/file2-slave.txt

Based on the file name file1-slave.txt have records corresponding to file1.txt

I want to move then to hdfs using flume but based on my reading so far I have the following problems:

  1. flume will not preserve my file name - so I lose the synchronization
  2. flume does not guaranty that the files from source will match the destination - e.g. source file might get split to several dest files

Is this correct? Can flume support this scenario?

Upvotes: 0

Views: 627

Answers (1)

Naga
Naga

Reputation: 1253

Flume Agent allows to move the data from source to sink. It uses channel to keep this data before rolling into Sink.

Flume's one of the sinks is HDFS Sink. HDFS sink allows rolling the data into HDFS based on the following criteria.

  • hdfs.rollSize
  • hdfs.rollInterval
  • hdfs.rollCount

It rolls the data based on the above parameters combination and file name are having predefined pattern. We can also control the file names using Sink parameters. But this pattern is same for all files which are rolled by this agent. We cannot expect different file path patterns from single flume agent.

agent.sinks.sink.hdfs.path=hdfs://:9000/pattern

Pattern could be static or dynamic path.

Flume also produces n number of files based on rolling criteria.

So Flume is not suitable for your requirement. Flume is best fit for streaming data ingestion.

DistCP: It is a distributed parallel data loading utility in HDFS. It is a Map only MapReduce program and it will produce n number of part files (= no of maps) in the destination directory.

So DistCP is also not suitable for tour requirement.

So Better to use hadoop fs -put to load the data into HDFS.

hadoop fs -put /home/dirMaster/ /home/dirMaster/ /home/

Upvotes: 1

Related Questions