John Simon
John Simon

Reputation: 826

Retrieving data from twitter using flume and storing to hdfs in JSON FORMAT

I am trying to retrieve data from twitter using flume and storing to hdfs in JSON FORMAT.And the data is loading to HDFS.BUT NOT IN JSON FORMAT.

I am attaching few lines from the HDFS file which is stored from twitter :

Objavro.schema\E4
{"type":"record","name":"Doc","doc":"adoc","fields":[{"name":"id","type":"string"},{"name":"user_friends_count","type":["int","null"]},{"name":"user_location","type":["string","null"]},{"name":"user_description","type":["string","null"]},{"name":"user_statuses_count","type":["int","null"]},{"name":"user_followers_count","type":["int","null"]},{"name":"user_name","type":["string","null"]},{"name":"user_screen_name","type":["string","null"]},{"name":"created_at","type":["string","null"]},{"name":"text","type":["string","null"]},{"name":"retweet_count","type":["long","null"]},{"name":"retweeted","type":["boolean","null"]},{"name":"in_reply_to_user_id","type":["long","null"]},{"name":"source","type":["string","null"]},{"name":"in_reply_to_status_id","type":["long","null"]},{"name":"media_url_https","type":["string","null"]},{"name":"expanded_url","type":["string","null"]}]}\00\E0D\C9H\B8$\DCb,C\8A5y\D1n\CE$733267766577356800\00\96\00Zumaran \00\C6C.A.B//C.A.H
Wsp:351 220-1251
Fb:Ramiro Pedernera✌
Insta:Ramiropedernera
Snapp:ramipedernera12\00\B2\9E\00\B2(\00(DIVI^Lista RAMIRO P.\00RamiPedernera12\00(2016-05-19T17:37:13Z\00tGaray culiadaso me metió una patada en la frente 😠😠\00\00\00\00\00\00\A8<a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a>\00\E0D\C9H\B8$\DCb,C\8A5y\D1n
Objavro.schema\E4

Since this is not in JSON format its not possible to process it by creating table in HIVE and loading this data. So Please help me to load the twitter data in JSON format to HADOOP HDFS

This is the command I used :

bin/flume-ng agent --conf ./conf/ -f conf/twitter.conf -Dflume.root.logger=DEBUG,console -n TwitterAgent

And the twitter.conf is attached:

TwitterAgent.sources = Twitter
TwitterAgent.channels = MemChannel
TwitterAgent.sinks = HDFS
TwitterAgent.sources.Twitter.type = org.apache.flume.source.twitter.TwitterSource
TwitterAgent.sources.Twitter.channels = MemChannel
TwitterAgent.sources.Twitter.consumerKey =********
TwitterAgent.sources.Twitter.consumerSecret =*************
TwitterAgent.sources.Twitter.accessToken =****************
TwitterAgent.sources.Twitter.accessTokenSecret =*****************
TwitterAgent.sources.Twitter.keywords = hadoop, big data, analytics, bigdata, cloudera, data science, data scientiest, business intelligence, mapreduce, data warehouse, data warehousing, mahout, hbase, nosql, newsql, businessintelligence, cloudcomputing
TwitterAgent.sinks.HDFS.channel = MemChannel
TwitterAgent.sinks.HDFS.type = hdfs
TwitterAgent.sinks.HDFS.hdfs.path = hdfs://localhost:54310/user/hduser_/twitter-cool
TwitterAgent.sinks.HDFS.hdfs.fileType = DataStream
TwitterAgent.sinks.HDFS.hdfs.writeFormat = json
TwitterAgent.sinks.HDFS.hdfs.batchSize = 1000
TwitterAgent.sinks.HDFS.hdfs.rollSize = 0
TwitterAgent.sinks.HDFS.hdfs.rollCount = 10000
TwitterAgent.channels.MemChannel.type = memory
TwitterAgent.channels.MemChannel.capacity = 10000
TwitterAgent.channels.MemChannel.transactionCapacity = 100
TwitterAgent.sources.Twitter.handler = org.apache.flume.source.http.JSONHandler

Upvotes: 2

Views: 4187

Answers (2)

Farooque
Farooque

Reputation: 3766

To change from Avro to JSON format you have to follow few steps:

In your config file change the property

TwitterAgent.sources.Twitter.type = org.apache.flume.source.twitter.TwitterSource

to

TwitterAgent.sources.Twitter.type = com.cloudera.flume.source.TwitterSource

com.cloudera.flume.source.TwitterSource is a custom class which writes the record in JSON format in HDFS.

To get this class you go to https://github.com/cloudera/cdh-twitter-example and download flume-sources folder to your local and make the jar file from it.

  1. To build the flume-sources JAR:

    $ cd hive-serdes
    $ mvn package
    $ cd ..

This will generate a file called flume-sources-1.0-SNAPSHOT.jar in the target directory.

  1. Add the JAR to the Flume classpath

Copy flume-sources-1.0-SNAPSHOT.jar to /usr/lib/flume-ng/plugins.d/twitter-streaming/lib/flume-sources-1.0-SNAPSHOT.jar and also to /var/lib/flume-ng/plugins.d/twitter-streaming/lib/flume-sources-1.0-SNAPSHOT.jar

if those directories do not exist, then create as

sudo mkdir -p /usr/lib/flume-ng/plugins.d/twitter-streaming/lib/

sudo mkdir -p /var/lib/flume-ng/plugins.d/twitter-streaming/lib/

For more please refer to Analyzing Twitter Data Using CDH

Hope this help you!!!

Upvotes: 2

puma91
puma91

Reputation: 386

The events from TwitterSource from Flume are in Avro format by default. To change that you would have to modify the source files of the TwitterSource to get the tweets in raw format (json). Fortunately, Cloudera already did that in here https://github.com/cloudera/cdh-twitter-example

All you have to do is install the libraries for a new TwitterSource following the steps in the readme and change the TwitterAgent.sources.Twitter.type in the Flume config file to com.cloudera.flume.source.TwitterSource. There is an example of the config file in the same project.

Hope it helps

Upvotes: 0

Related Questions