Stuti Varshney
Stuti Varshney

Reputation: 23

Query github data using Hadoop

I am trying to query GitHub data provided by ghtorrent API using hadoop. how can I inject this much data(4-5 TB) into HDFS? Also, their databases are real time. Is it possible to process real time data in hadoop using tools such as pig, hive, hbase?

Upvotes: 0

Views: 93

Answers (1)

OneUser
OneUser

Reputation: 187

Go through this presentation . It has described the way you can connect to their MySql or MongoDb instance and fetch data. Basically you have to share your public key, they will add that key to their repository and then you can ssh. As an alternative you can download their periodic dumps from this link

Imp Link :

For processing real time data, you cannt do that uisng Pig, Hive. Those are Batch processing tools. Consider using Apache Spark.

Upvotes: 1

Related Questions