Reputation: 2188
I am using Amazon Elastic Map Reduce 4.7.1, Hadoop 2.7.2, Hive 1.0.0, and Spark 1.6.1.
Use case: I have a Spark cluster used for processing data. That data is stored in S3 as Parquet files. I want tools to be able to query the data using names that are registered in the Hive Metastore (eg, looking up the foo
table rather than the parquet.`s3://bucket/key/prefix/foo/parquet`
style of doing things). I also want this data to persist for the lifetime of the Hive Metastore (a separate RDS instance) even if I tear down the EMR cluster and spin up a new one connected to the same Metastore.
Problem: if I do something like sqlContext.saveAsTable("foo")
that will, by default, create a managed table in the Hive Metastore (see https://spark.apache.org/docs/latest/sql-programming-guide.html). These managed tables copy the data from S3 to HDFS on the EMR cluster, which means the metadata would be useless after tearing down the EMR cluster.
Upvotes: 7
Views: 8738
Reputation: 51
The way I solve this problem is: First, create the Hive table in Spark:
schema = StructType(
[StructField("key", IntegerType(), True),
StructField("value", StringType(), True)]
)
df = spark.Catalog \
.createTable("data1", "s3n://XXXX-Buket/data1", schema = schema)
Next, in Hive, it will appear the table created from Spark as above. (in this case data1
)
In addition, in the other hive engine, you can link to this data in S3 by CREATE EXTERNAL TABLE data
with the same type as created in spark:
command:
CREATE EXTERNAL TABLE data1 (key INT, value STRING) STORED AS PARQUET LOCATION 's3n://XXXX-Buket/data1’
Upvotes: 5
Reputation: 49
You don't need EMR for this. Just fire up Athena, create a table to read the data in Parquet format. This is a much more inexpensive option than EMR, and also sustainable. You can use JDBC to access this data via Athena in realtime.
Upvotes: 0
Reputation: 2188
The solution was to register the S3 file as an external table.
sqlContext.createExternalTable("foo", "s3://bucket/key/prefix/foo/parquet")
I haven't figured out how to save a file to S3 and register it as an external table all in one shot, but createExternalTable
doesn't add too much overhead.
Upvotes: 5