How to create managed hive table with specified location through Spark SQL?

I want to create managed table with location on AWS S3 through spark sql, but if I specify the location it creates EXTERNAL table even if I didn't specify this keyword.

CREATE TABLE IF NOT EXISTS database.tableOnS3(name string)
LOCATION 's3://mybucket/';

Why do they imply EXTERNAL keyword here...

If I execute this query in hive console it's creating managed table, so how to do the same in spark?

Upvotes: 4

Answers (3)

sasikala d

Reputation: 41

change the tableType to MANAGED once the External table is created.

import org.apache.spark.sql.catalyst.TableIdentifier
import org.apache.spark.sql.catalyst.catalog.CatalogTableType

val identifier = TableIdentifier(yourTableName, Some(yourDatabaseName)     
spark.sessionState.catalog.alterTable(spark.sessionState.catalog.getTableMetadata(identifier).copy(tableType = CatalogTableType.MANAGED))

Upvotes: 4

kfkhalili

Reputation: 1035

Looking at the documentation in the Hive Confluence, emphasis my own.

This document lists some of the differences between the two but the fundamental difference is that Hive assumes that it owns the data for managed tables. That means that the data, its properties and data layout will and can only be changed via Hive command. The data still lives in a normal file system and nothing is stopping you from changing it without telling Hive about it. If you do though it violates invariants and expectations of Hive and you might see undefined behavior.

So essentially, the reason EXTERNAL is assumed is because you are setting the location and hence, Hive doesn't own/have control of the data.

The way to do this, i.e. create a MANAGED table with a custom location, is to first create an EXTERNAL table with the location set. This cannot be avoided for the reasons mentioned above, and then modifying the table metadata to MANAGED. Note that as the documentation states, this can lead to undefined behavior.

// Following your example Hive statement creates an EXTERNAL table
CREATE TABLE IF NOT EXISTS database.tableOnS3(name string) LOCATION 's3://mybucket/';

// Change table type from within Hive, changing from EXTERNAL to MANAGED
ALTER TABLE database.tableOnS3 SET TBLPROPERTIES('EXTERNAL'='FALSE');

// Or from within spark
import org.apache.spark.sql.catalyst.TableIdentifier
import org.apache.spark.sql.catalyst.catalog.CatalogTable
import org.apache.spark.sql.catalyst.catalog.CatalogTableType

// Get External Catalog
val catalog = spark.sharedState.externalCatalog

// Identify the table in question
val identifier = TableIdentifier("tableOnS3", Some("database"))

// Get its current metadata
val tableMetadata = catalog.getTableMetadata(identifier)

// Clone the metadata while changing the tableType to MANAGED
val alteredMetadata = tableMetadata.copy(tableType = CatalogTableType.MANAGED)

// Alter the table using the new metadata
catalog.alterTable(alteredMetadata)

And now you have a MANAGED table with the location manually set.

Upvotes: 2

Ram Ghadiyaram

Reputation: 29185

See docs Hive fundamentally knows two different types of tables:

Managed (Internal)
External

Managed tables : A managed table is stored under the hive.metastore.warehouse.dir path property, by default in a folder path similar to /user/hive/warehouse/databasename.db/tablename/. The default location can be overridden by the location property during table creation. If a managed table or partition is dropped, the data and metadata associated with that table or partition are deleted. If the PURGE option is not specified, the data is moved to a trash folder for a defined duration.

Use managed tables when Hive should manage the lifecycle of the table, or when generating temporary tables.

External tables : An external table describes the metadata / schema on external files. External table files can be accessed and managed by processes outside of Hive. External tables can access data stored in sources such as Azure Storage Volumes (ASV) or remote HDFS locations. If the structure or partitioning of an external table is changed, an MSCK REPAIR TABLE table_name statement can be used to refresh metadata information.

Use external tables when files are already present or in remote locations, and the files should remain even if the table is dropped.

Conclusion :

since you are using s3 location which is external its showing like that.

Further you want to understand how code works see CreateTableLikeCommand : in this val tblType = if (location.isEmpty) CatalogTableType.MANAGED else CatalogTableType.EXTERNAL is where it dynamically decides...

/**
 * A command to create a table with the same definition of the given existing table.
 * In the target table definition, the table comment is always empty but the column comments
 * are identical to the ones defined in the source table.
 *
 * The CatalogTable attributes copied from the source table are storage(inputFormat, outputFormat,
 * serde, compressed, properties), schema, provider, partitionColumnNames, bucketSpec.
 *
 * The syntax of using this command in SQL is:
 * {{{
 *   CREATE TABLE [IF NOT EXISTS] [db_name.]table_name
 *   LIKE [other_db_name.]existing_table_name [locationSpec]
 * }}}
 */
case class CreateTableLikeCommand(
    targetTable: TableIdentifier,
    sourceTable: TableIdentifier,
    location: Option[String],
    ifNotExists: Boolean) extends RunnableCommand {

  override def run(sparkSession: SparkSession): Seq[Row] = {
    val catalog = sparkSession.sessionState.catalog
    val sourceTableDesc = catalog.getTempViewOrPermanentTableMetadata(sourceTable)

    val newProvider = if (sourceTableDesc.tableType == CatalogTableType.VIEW) {
      Some(sparkSession.sessionState.conf.defaultDataSourceName)
    } else {
      sourceTableDesc.provider
    }

    // If the location is specified, we create an external table internally.
    // Otherwise create a managed table.
    val tblType = if (location.isEmpty) CatalogTableType.MANAGED else CatalogTableType.EXTERNAL

    val newTableDesc =
      CatalogTable(
        identifier = targetTable,
        tableType = tblType,
        storage = sourceTableDesc.storage.copy(
          locationUri = location.map(CatalogUtils.stringToURI(_))),
        schema = sourceTableDesc.schema,
        provider = newProvider,
        partitionColumnNames = sourceTableDesc.partitionColumnNames,
        bucketSpec = sourceTableDesc.bucketSpec)

    catalog.createTable(newTableDesc, ifNotExists)
    Seq.empty[Row]
  }
}

Update : If I execute this query in hive console it's creating managed table, so how to do the same in spark?

hope you are using same local location(not different vpc) where hive and spark co-exists. if so then set

spark.sql.warehouse.dir=hdfs:///... to s3 location

using spark conf.... you may need to set access key and secret id credentials to spark config object for creating spark session.

Upvotes: 3

How to create managed hive table with specified location through Spark SQL?

Answers (3)

Related Questions