How to register S3 Parquet files in a Hive Metastore using Spark on EMR

I am using Amazon Elastic Map Reduce 4.7.1, Hadoop 2.7.2, Hive 1.0.0, and Spark 1.6.1.

Use case: I have a Spark cluster used for processing data. That data is stored in S3 as Parquet files. I want tools to be able to query the data using names that are registered in the Hive Metastore (eg, looking up the foo table rather than the parquet.`s3://bucket/key/prefix/foo/parquet` style of doing things). I also want this data to persist for the lifetime of the Hive Metastore (a separate RDS instance) even if I tear down the EMR cluster and spin up a new one connected to the same Metastore.

Problem: if I do something like sqlContext.saveAsTable("foo") that will, by default, create a managed table in the Hive Metastore (see https://spark.apache.org/docs/latest/sql-programming-guide.html). These managed tables copy the data from S3 to HDFS on the EMR cluster, which means the metadata would be useless after tearing down the EMR cluster.

Upvotes: 7

Answers (3)

Ho Thuan

Reputation: 51

The way I solve this problem is: First, create the Hive table in Spark:

schema = StructType(
         [StructField("key", IntegerType(), True),
          StructField("value", StringType(), True)]
)
df = spark.Catalog \
          .createTable("data1", "s3n://XXXX-Buket/data1", schema = schema)

Next, in Hive, it will appear the table created from Spark as above. (in this case data1)

In addition, in the other hive engine, you can link to this data in S3 by CREATE EXTERNAL TABLE data with the same type as created in spark: command:

CREATE EXTERNAL TABLE data1 (key INT, value STRING) STORED AS PARQUET LOCATION 's3n://XXXX-Buket/data1’

Upvotes: 5

Sandip Sinha

Reputation: 49

You don't need EMR for this. Just fire up Athena, create a table to read the data in Parquet format. This is a much more inexpensive option than EMR, and also sustainable. You can use JDBC to access this data via Athena in realtime.

Upvotes: 0

Sam King

Reputation: 2188

The solution was to register the S3 file as an external table.

sqlContext.createExternalTable("foo", "s3://bucket/key/prefix/foo/parquet")

I haven't figured out how to save a file to S3 and register it as an external table all in one shot, but createExternalTable doesn't add too much overhead.

Upvotes: 5

How to register S3 Parquet files in a Hive Metastore using Spark on EMR

Answers (3)

Related Questions