user5228393
user5228393

Reputation:

Main purpose of the MetaStore in Hive?

I am a little confused on the purpose of the MetaStore. When you create a table in hive:

CREATE TABLE <table_name> (column1 data_type, column2 data_type);
LOAD DATA INPATH <HDFS_file_location> INTO table managed_table;

So I know this command takes the contents of the file in HDFS and creates a MetaData form of it and stores it in the MetaStore (including column types, column names, the place where it is in HDFS, etc. of each row in the HDFS file). It doesn't actually move the data from HDFS into Hive.

But what is the purpose of storing this MetaData?

When I connect to Hive using Spark SQL for example the MetaStore doesn't contain the actual information in HDFS but just MetaData. So is the MetaStore simply used by Hive to do parsing and compiling steps against the HiveQL query and to create the MapReduce jobs?

Upvotes: 2

Views: 1819

Answers (2)

OneCricketeer
OneCricketeer

Reputation: 191711

Hive performs schema-on-read operations, which means that for the data to be processed in some structured manner (i.e. a table-like object), the layout of said data needs to be summarized in a relational structure

takes the contents of the file in HDFS and creates a MetaData form of it

As far as I know, no files are actually read when you create a table.

SparkSQL connects to the metastore directly. Both Spark and HiveServer have their own query parsers. It's not part of the metastore. MapReduce/Tez/Spark jobs are also not handled by the metastore. It's just a relational database. If it's Mysql, Postgres, or Oracle, you can easily go connect to it and inspect the contents. By default, both Hive and Spark use an embedded Derby database

Upvotes: 1

leftjoin
leftjoin

Reputation: 38290

Metastore is for storing schema(table definitions including location in HDFS, serde, columns, comments, types, partition definitions, views, access permissions, etc) and statistics. There is no such operation as moving data from HDFS to Hive because Hive tables data is stored in HDFS(or other compatible filesystem like S3). You can define new table or even few tables on top of some location in HDFS and put files in it. You can change existing table location or partition location, all this information is stored in the metastore, so Hive knows how to access data. Table is a logical object defined in the metastore and data itself are just files in some location in HDFS.

See also this answer about Hive query execution flow(high level): https://stackoverflow.com/a/45587873/2700344

Upvotes: 1

Related Questions