Reputation: 1960

How to extract all the collected tweets in a single file

I'm using Flume to collect tweets and store them on HDFS. The collecting part is working fine, and I can find all my tweets in my file system.

Now I would like to extract all these tweets in one single file. The problem is that the different tweets are stored as follow :

As we can see, the tweets are stored inside blocks of 128 MB but only use a few Ko, which is a normal behaviour for HDFS correct me if I'm wrong.

However how could I get all the different tweets on one file ?

Here is my conf file that I run with the follwing command :

flume-ng agent -n TwitterAgent -f ./my-flume-files/twitter-stream-tvseries.conf

twitter-stream-tvseries.conf :

TwitterAgent.sources = Twitter

TwitterAgent.channels = MemChannel

TwitterAgent.sinks = HDFS

TwitterAgent.sources.Twitter.type = org.apache.flume.source.twitter.TwitterSource TwitterAgent.sources.Twitter.consumerKey=hidden TwitterAgent.sources.Twitter.consumerSecret=hidden TwitterAgent.sources.Twitter.accessToken=hidden TwitterAgent.sources.Twitter.accessTokenSecret=hidden TwitterAgent.sources.Twitter.keywords=GoT, GameofThrones

TwitterAgent.sources.Twitter.keywords=GoT, GameofThrones

TwitterAgent.sinks.HDFS.channel=MemChannel TwitterAgent.sinks.HDFS.type=hdfs TwitterAgent.sinks.HDFS.hdfs.path=hdfs://ip-addressl:8020/user/root/data/twitter/tvseries/tweets TwitterAgent.sinks.HDFS.hdfs.fileType=DataStream TwitterAgent.sinks.HDFS.hdfs.writeformat=Text TwitterAgent.sinks.HDFS.hdfs.batchSize=1000 TwitterAgent.sinks.HDFS.hdfs.rollSize=0 TwitterAgent.sinks.HDFS.hdfs.rollCount=10000 TwitterAgent.sinks.HDFS.hdfs.rollInterval=600

TwitterAgent.channels.MemChannel.type=memory TwitterAgent.channels.MemChannel.capacity=10000 TwitterAgent.channels.MemChannel.transactionCapacity=1000

TwitterAgent.sources.Twitter.channels = MemChannel TwitterAgent.sinks.HDFS.channel = MemChannel

Upvotes: 2

Answers (2)

user6402628

Reputation:

You can use the following commands to concatenate the files into single file:

find . -type f -name 'FlumeData*' -exec cat {} + >> output.file

or if you want to store the data into Hive tables for later analysis, create an external table and consume it into Hive DB.

Upvotes: 0

Tortxu13

Reputation: 117

You can configure the HDFS sink to produce a message by time, event or size. So, if you want to save multiple messages till 120MB limit is reached, set

hdfs.rollInterval = 0 # This is to create new file based on time
hdfs.rollSize = 125829120 # This is to create new file based on size
hdfs.rollCount = 0 # This is to create new file based on events (different tweets in your case)

Upvotes: 0

How to extract all the collected tweets in a single file

Answers (2)

Related Questions