Reputation: 1809
Summary: I have a multiplexing scenario, and would like to know how to multiplex dynamically - not based on a value statically configured, but based on the variable value of a field(e.g. dates).
Details: I have an input, that is separated by an entityId. As I know the entities that I am working with, I can configure it in typical Flume multi-channel selection.
agent.sources.jmsSource.channels = chan-10 chan-11 # ...
agent.sources.jmsSource.selector.type = multiplexing
agent.sources.jmsSource.selector.header = EntityId
agent.sources.jmsSource.selector.mapping.10 = chan-10
agent.sources.jmsSource.selector.mapping.11 = chan-11
# ...
Each of the channels goes to a separate HDFSEventSink, "hdfsSink-n":
agent.sinks.hdfsSink-10.channel = chan-10
agent.sinks.hdfsSink-10.hdfs.path = hdfs://some/path/
agent.sinks.hdfsSink-10.hdfs.filePrefix = entity10
# ...
agent.sinks.hdfsSink-11.channel = chan-11
agent.sinks.hdfsSink-11.hdfs.path = hdfs://some/path/
agent.sinks.hdfsSink-11.hdfs.filePrefix = entity11
# ...
This generates a file per entity, which is fine. Now I want to introduce a second variable, which is dynamic: a date. Depending on event date, I want to create files per-entity per-date. Date is a dynamic value, so I cannot preconfigure a number of sinks so each one sends to a separate file. Also, you can only specify one HDFS output per Sink.
So, it's like a "Multiple Outputs HDFSEventSink" was needed (in a similar way as Hadoop's MultipleOutputs library). Is there such a functionality in Flume?
If not, is there any elegant way to fix this or work this around? Another option is to modify HDFSEventSink and it seems it could be implemented, by having a different creation of "realName" (String) for each event.
Upvotes: 1
Views: 429
Reputation: 54
Actually you can specific the variable in you hdfs sink's path or filePrefix. For example, if the variable's key is "date" in event's headers, then you can configure like this:
agent.sinks.hdfsSink-11.hdfs.filePrefix = entity11-%{date}
Upvotes: 1