Reputation: 4092
I have a directory tree with two directories and synchronized files in them:
home/dirMaster/file1.txt
home/dirMaster/file2.txt
home/dirSlave/file1-slave.txt
home/dirSlave/file2-slave.txt
Based on the file name file1-slave.txt have records corresponding to file1.txt
I want to move then to hdfs using flume but based on my reading so far I have the following problems:
Is this correct? Can flume support this scenario?
Upvotes: 0
Views: 627
Reputation: 1253
Flume Agent allows to move the data from source to sink. It uses channel to keep this data before rolling into Sink.
Flume's one of the sinks is HDFS Sink. HDFS sink allows rolling the data into HDFS based on the following criteria.
It rolls the data based on the above parameters combination and file name are having predefined pattern. We can also control the file names using Sink parameters. But this pattern is same for all files which are rolled by this agent. We cannot expect different file path patterns from single flume agent.
agent.sinks.sink.hdfs.path=hdfs://:9000/pattern
Pattern could be static or dynamic path.
Flume also produces n number of files based on rolling criteria.
So Flume is not suitable for your requirement. Flume is best fit for streaming data ingestion.
DistCP: It is a distributed parallel data loading utility in HDFS. It is a Map only MapReduce program and it will produce n number of part files (= no of maps) in the destination directory.
So DistCP is also not suitable for tour requirement.
So Better to use hadoop fs -put
to load the data into HDFS.
hadoop fs -put /home/dirMaster/ /home/dirMaster/ /home/
Upvotes: 1