Reputation: 826
I am trying to retrieve data from twitter using flume and storing to hdfs in JSON FORMAT.And the data is loading to HDFS.BUT NOT IN JSON FORMAT.
I am attaching few lines from the HDFS file which is stored from twitter :
Objavro.schema\E4
{"type":"record","name":"Doc","doc":"adoc","fields":[{"name":"id","type":"string"},{"name":"user_friends_count","type":["int","null"]},{"name":"user_location","type":["string","null"]},{"name":"user_description","type":["string","null"]},{"name":"user_statuses_count","type":["int","null"]},{"name":"user_followers_count","type":["int","null"]},{"name":"user_name","type":["string","null"]},{"name":"user_screen_name","type":["string","null"]},{"name":"created_at","type":["string","null"]},{"name":"text","type":["string","null"]},{"name":"retweet_count","type":["long","null"]},{"name":"retweeted","type":["boolean","null"]},{"name":"in_reply_to_user_id","type":["long","null"]},{"name":"source","type":["string","null"]},{"name":"in_reply_to_status_id","type":["long","null"]},{"name":"media_url_https","type":["string","null"]},{"name":"expanded_url","type":["string","null"]}]}\00\E0D\C9H\B8$\DCb,C\8A5y\D1n\CE$733267766577356800\00\96\00Zumaran \00\C6C.A.B//C.A.H
Wsp:351 220-1251
Fb:Ramiro Pedernera✌
Insta:Ramiropedernera
Snapp:ramipedernera12\00\B2\9E\00\B2(\00(DIVI^Lista RAMIRO P.\00RamiPedernera12\00(2016-05-19T17:37:13Z\00tGaray culiadaso me metió una patada en la frente 😠😠\00\00\00\00\00\00\A8<a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a>\00\E0D\C9H\B8$\DCb,C\8A5y\D1n
Objavro.schema\E4
Since this is not in JSON format its not possible to process it by creating table in HIVE and loading this data. So Please help me to load the twitter data in JSON format to HADOOP HDFS
This is the command I used :
bin/flume-ng agent --conf ./conf/ -f conf/twitter.conf -Dflume.root.logger=DEBUG,console -n TwitterAgent
And the twitter.conf is attached:
TwitterAgent.sources = Twitter
TwitterAgent.channels = MemChannel
TwitterAgent.sinks = HDFS
TwitterAgent.sources.Twitter.type = org.apache.flume.source.twitter.TwitterSource
TwitterAgent.sources.Twitter.channels = MemChannel
TwitterAgent.sources.Twitter.consumerKey =********
TwitterAgent.sources.Twitter.consumerSecret =*************
TwitterAgent.sources.Twitter.accessToken =****************
TwitterAgent.sources.Twitter.accessTokenSecret =*****************
TwitterAgent.sources.Twitter.keywords = hadoop, big data, analytics, bigdata, cloudera, data science, data scientiest, business intelligence, mapreduce, data warehouse, data warehousing, mahout, hbase, nosql, newsql, businessintelligence, cloudcomputing
TwitterAgent.sinks.HDFS.channel = MemChannel
TwitterAgent.sinks.HDFS.type = hdfs
TwitterAgent.sinks.HDFS.hdfs.path = hdfs://localhost:54310/user/hduser_/twitter-cool
TwitterAgent.sinks.HDFS.hdfs.fileType = DataStream
TwitterAgent.sinks.HDFS.hdfs.writeFormat = json
TwitterAgent.sinks.HDFS.hdfs.batchSize = 1000
TwitterAgent.sinks.HDFS.hdfs.rollSize = 0
TwitterAgent.sinks.HDFS.hdfs.rollCount = 10000
TwitterAgent.channels.MemChannel.type = memory
TwitterAgent.channels.MemChannel.capacity = 10000
TwitterAgent.channels.MemChannel.transactionCapacity = 100
TwitterAgent.sources.Twitter.handler = org.apache.flume.source.http.JSONHandler
Upvotes: 2
Views: 4187
Reputation: 3766
To change from Avro to JSON format you have to follow few steps:
In your config file change the property
TwitterAgent.sources.Twitter.type = org.apache.flume.source.twitter.TwitterSource
to
TwitterAgent.sources.Twitter.type = com.cloudera.flume.source.TwitterSource
com.cloudera.flume.source.TwitterSource
is a custom class which writes the record in JSON format in HDFS.
To get this class you go to https://github.com/cloudera/cdh-twitter-example and download flume-sources folder to your local and make the jar file from it.
To build the flume-sources JAR:
$ cd hive-serdes
$ mvn package
$ cd ..
This will generate a file called flume-sources-1.0-SNAPSHOT.jar in the target directory.
Copy flume-sources-1.0-SNAPSHOT.jar
to /usr/lib/flume-ng/plugins.d/twitter-streaming/lib/flume-sources-1.0-SNAPSHOT.jar
and also to /var/lib/flume-ng/plugins.d/twitter-streaming/lib/flume-sources-1.0-SNAPSHOT.jar
if those directories do not exist, then create as
sudo mkdir -p /usr/lib/flume-ng/plugins.d/twitter-streaming/lib/
sudo mkdir -p /var/lib/flume-ng/plugins.d/twitter-streaming/lib/
For more please refer to Analyzing Twitter Data Using CDH
Hope this help you!!!
Upvotes: 2
Reputation: 386
The events from TwitterSource from Flume are in Avro format by default. To change that you would have to modify the source files of the TwitterSource to get the tweets in raw format (json). Fortunately, Cloudera already did that in here https://github.com/cloudera/cdh-twitter-example
All you have to do is install the libraries for a new TwitterSource following the steps in the readme and change the TwitterAgent.sources.Twitter.type
in the Flume config file to com.cloudera.flume.source.TwitterSource
. There is an example of the config file in the same project.
Hope it helps
Upvotes: 0