Reputation: 1960
I'm using Flume to collect tweets and store them on HDFS. The collecting part is working fine, and I can find all my tweets in my file system.
Now I would like to extract all these tweets in one single file.
The problem is that the different tweets are stored as follow :
As we can see, the tweets are stored inside blocks of 128 MB but only use a few Ko, which is a normal behaviour for HDFS correct me if I'm wrong.
However how could I get all the different tweets on one file ?
Here is my conf file that I run with the follwing command :
flume-ng agent -n TwitterAgent -f ./my-flume-files/twitter-stream-tvseries.conf
twitter-stream-tvseries.conf :
TwitterAgent.sources = Twitter
TwitterAgent.channels = MemChannel
TwitterAgent.sinks = HDFS
TwitterAgent.sources.Twitter.type = org.apache.flume.source.twitter.TwitterSource TwitterAgent.sources.Twitter.consumerKey=hidden TwitterAgent.sources.Twitter.consumerSecret=hidden TwitterAgent.sources.Twitter.accessToken=hidden TwitterAgent.sources.Twitter.accessTokenSecret=hidden TwitterAgent.sources.Twitter.keywords=GoT, GameofThrones
TwitterAgent.sources.Twitter.keywords=GoT, GameofThrones
TwitterAgent.sinks.HDFS.channel=MemChannel TwitterAgent.sinks.HDFS.type=hdfs TwitterAgent.sinks.HDFS.hdfs.path=hdfs://ip-addressl:8020/user/root/data/twitter/tvseries/tweets TwitterAgent.sinks.HDFS.hdfs.fileType=DataStream TwitterAgent.sinks.HDFS.hdfs.writeformat=Text TwitterAgent.sinks.HDFS.hdfs.batchSize=1000 TwitterAgent.sinks.HDFS.hdfs.rollSize=0 TwitterAgent.sinks.HDFS.hdfs.rollCount=10000 TwitterAgent.sinks.HDFS.hdfs.rollInterval=600
TwitterAgent.channels.MemChannel.type=memory TwitterAgent.channels.MemChannel.capacity=10000 TwitterAgent.channels.MemChannel.transactionCapacity=1000
TwitterAgent.sources.Twitter.channels = MemChannel TwitterAgent.sinks.HDFS.channel = MemChannel
Upvotes: 2
Views: 538
Reputation:
You can use the following commands to concatenate the files into single file:
find . -type f -name 'FlumeData*' -exec cat {} + >> output.file
or if you want to store the data into Hive tables for later analysis, create an external table and consume it into Hive DB.
Upvotes: 0
Reputation: 117
You can configure the HDFS sink to produce a message by time, event or size. So, if you want to save multiple messages till 120MB limit is reached, set
hdfs.rollInterval = 0 # This is to create new file based on time
hdfs.rollSize = 125829120 # This is to create new file based on size
hdfs.rollCount = 0 # This is to create new file based on events (different tweets in your case)
Upvotes: 0