Reputation: 719
I want to use flume to transfert data from hdfs directory into directory in hdfs, in this transfer I want to apply processing morphline.
For example: my source is
"hdfs://localhost:8020/user/flume/data"
and my sink is
"hdfs://localhost:8020/user/morphline/"
Is it possible with flume?
If yes, what is the type for the source flume?
Upvotes: 1
Views: 3853
Reputation: 1203
Another option is to connect a netcat source to the same sink and just cat
the files into it...
Upvotes: 0
Reputation: 3798
as far as I know, there is no source for reading HDFS data. The main reason is that Flume is intended for moving large amount of data that in some way is sent to the agent. As stated in the documentation:
"A Flume source consumes events delivered to it by an external source like a web server. The external source sends events to Flume in a format that is recognized by the target Flume source. For example, an Avro Flume source can be used to receive Avro events from Avro clients or other Flume agents in the flow that send events from an Avro sink. A similar flow can be defined using a Thrift Flume Source to receive events from a Thrift Sink or a Flume Thrift Rpc Client or Thrift clients written in any language generated from the Flume thrift protocol."
You have all the available sources at the official web page.
Being said that, you will need some process in charge of reading the input HDFS file and send it to any of the available sources. Probably the ExecSource
is suitable for your needs, due to you can specify a command that will be run in order to produce the input data. Such a command could be a hadoop fs -cat /hdfs/path/to/input/data
or something like that.
Nevertheless, and thinking on the processing you want to do, I guess you will need a custom sink in order to achieve it. I mean, the source part is just for reading the data and putting it into the Flume channel in the form of Flume events. Then, a sink or sinks will consume such events by processing them and generating the appropriate output.
Upvotes: 5