Gabriel Avellaneda
Gabriel Avellaneda

Reputation: 742

How to create an EXTERNAL Spark table from data in HDFS

I have loaded a parquet table from HDFS into a DataFrame:

val df = spark.read.parquet("hdfs://user/zeppelin/my_table")

I now want to expose this table to Spark SQL but this must be a persitent table because I want to access it from a JDBC connection or other Spark Sessions.

Quick way could be to call df.write.saveAsTable method, but in this case it will materialize the contents of the DataFrame and create a pointer to the data in the Hive metastore, creating another copy of the data in HDFS.

I don't want to have two copies of the same data, so I would want create like an external table to point to existing data.

Upvotes: 5

Views: 17622

Answers (1)

Gabriel Avellaneda
Gabriel Avellaneda

Reputation: 742

To create a Spark External table you must specify the "path" option of the DataFrameWriter. Something like this:

df.write.
  option("path","hdfs://user/zeppelin/my_mytable").
  saveAsTable("my_table")

The problem though, is that it will empty your hdfs path hdfs://user/zeppelin/my_mytable eliminating your existing files and will cause an org.apache.spark.SparkException: Job aborted.. This looks like a bug in Spark API...

Anyway, the workaround to this (tested in Spark 2.3) is to create an external table but from a Spark DDL. If your table have many columns creating the DDL could be a hassle. Fortunately, starting from Spark 2.0, you could call the DDL SHOW CREATE TABLE to let spark do the hard work. The problem is that you can actually run the SHOW CREATE TABLE in a persistent table.

If the table is pretty big, I recommend to get a sample of the table, persist it to another location, and then get the DDL. Something like this:

// Create a sample of the table 
val df = spark.read.parquet("hdfs://user/zeppelin/my_table")
df.limit(1).write.
    option("path", "/user/zeppelin/my_table_tmp").
    saveAsTable("my_table_tmp")

// Now get the DDL, do not truncate output
spark.sql("SHOW CREATE TABLE my_table_tmp").show(1, false)

You are going to get a DDL like:

CREATE TABLE `my_table_tmp` (`ID` INT, `Descr` STRING)
USING parquet
OPTIONS (
  `serialization.format` '1',
  path 'hdfs:///user/zeppelin/my_table_tmp')

Which you would want to change to have the original name of the table and the path to the original data. You can now run the following to create the Spark External table pointing to your existing HDFS data:

spark.sql("""
  CREATE TABLE `my_table` (`ID` INT, `Descr` STRING)
  USING parquet
  OPTIONS (
    `serialization.format` '1',
    path 'hdfs:///user/zeppelin/my_table')""")

Upvotes: 3

Related Questions