Krishna
Krishna

Reputation: 13

Export big query data into in house Hadoop Cluster

We have GA data in Big query, and some of my users want to join that to in house data in Hadoop which we can not move to Big Query.

Please let me know what is the best way to do this.

Upvotes: 0

Views: 1791

Answers (2)

activelearner
activelearner

Reputation: 7795

You could follow the route of the Hadoop connecter as Felipe Hoffa suggested.. Or build your own application which will transfer data from BigQuery to your Hadoop cluster. In both ways, you will be able to make the required joins on the hadoop cluster using Pig, Hive etc.

In case you want to try the application method, let me take you through a process flow which your application may need to follow:

  1. Query BQ tables (flatten any nested or repeated fields)
  2. If your query response is too large, you can divert this response into a destination table. Your destination table is simply another table in BigQuery.
  3. You can then export this destination table to a GCS bucket. This uses another query request. You will have options to choose an export format, compression type, split up the data into multiple files etc.
  4. From the GCS bucket, using a tool called gsutil, you can copy the files to your cluster gateway machine.
  5. From your cluster gateway machine, you can use the hadoop command 'copyFromLocal' to copy this data to your HDFS directory.
  6. Once it is in a HDFS directory, you can create a hive external table pointing to this HDFS directory. Your data will now be available in the Hive table. Ready to be joined with the in house data on your cluster.

Let me know if you need anymore details or clarifications. I went down this route because I found the connector alternative a little too complex. But that is a subjective opinion varying from a person to person.

Upvotes: 1

Felipe Hoffa
Felipe Hoffa

Reputation: 59325

See BigQuery to Hadoop Cluster - How to transfer data?:

The easiest way to go from BigQuery to Hadoop is to use the official Google BigQuery Connector for Hadoop

https://cloud.google.com/hadoop/bigquery-connector

This connector defines a BigQueryInputFormat class.

  • Write a query to select the appropriate BigQuery objects.
  • Splits the results of the query evenly among the Hadoop nodes.
  • Parses the splits into java objects to pass to the mapper. The Hadoop Mapper class receives a JsonObject representation of each selected BigQuery object.

(It uses Google Cloud Storage as an intermediary between BigQuery's data and the splits that Hadoop consumes)

Upvotes: 1

Related Questions