david1977
david1977

Reputation: 769

Is it possible to have multiple hive tables represented within the same HDFS directory structure?

Is it possible to have multiple hive tables represented within the same HDFS directory structure? In other words, is there a way to have multiple hive tables pointing to same/overlapping HDFS paths?

Here is my situation:

I have a table named "mytable", located in hdfs:/tables/mytable

CREATE EXTERNAL TABLE mytable
(
id int,
...
[a whole bunch of columns]
...
PARTITIONED BY (logname STRING)
STORED AS [I-do-not-know-what-just-yet]
LOCATION 'hdfs:/tables/mytable';

So, HDFS will look like:

hdfs:/tables/mytable/logname=tarzan/....
hdfs:/tables/mytable/logname=jane/....
hdfs:/tables/mytable/logname=whoa/....

Is it possible to have a hive table, named "tarzan", located in hdfs:/tables/mytable/logname=tarzan ? Same with hive table "jane", located in hdfs:/tables/mytable/logname=jane, etc.

The tarzan, jane, whoa, etc sub-tables share some columns (timestamp, ip_address, country, user_id, and some others), but there will also be a lot of columns that they do not have in common.

Is there a way to store this data once in HDFS, and use it for multiple tables as I described above? Furthermore, is there a way to store the data in an efficient way, since many of the tables will have columns that are not in common? Would a file format like RCFILE or PARQUET work in this case?

Thanks so much for any hints or help anyone can provide,

Upvotes: 4

Views: 7521

Answers (3)

Parvinder Singh
Parvinder Singh

Reputation: 495

Yes, we can have multiple hive tables with the same underlying HDFS directory.

Example:

  1. Create table emp and load data file file3 into it.

    create table emp (id int, name string, salary int) 
    row format delimited 
    fields terminated by ','  
    -- default location would be used
    
    load data 
    local inpath '/home/parv/testfiles/file3' 
    into table emp; 
    
  2. Create another table mirror. When you will select data from mirror table, it would be as same as of emp table (contents of file3).

    create table mirror (id int, name string, salary int) 
    row format delimited 
    fields terminated by ',' 
    location 'hdfs:///user/hive/warehouse/parv.db/base';
    
  3. Load data into mirror table. When you will select data either from mirror table or emp table, it would return same results (contents of file3 and file4).

    load data 
    local inpath '/home/parv/testfiles/file4' 
    into table mirror;  
    

Conclusion:

  1. Same data files are shared among both tables emp and mirror.
  2. But, strange, the HDFS filesystem only shows data directory for emp table and not for mirror table. However, both the tables are present in hive and so can be queried.

Upvotes: 3

david1977
david1977

Reputation: 769

Answering my own question:

It IS possible to have multiple hive tables represented by the same HDFS directory structure, but for what I am looking to do:

  1. A mytable table partitioned by logname (logname=tarzan, logname=jane, etc...)
  2. A separate table for each logname: A "tarzan" table with only columns used by the tarzan table, and not any other logname, same for the "jane" table, etc
  3. Only represent the data one time in HDFS

A better solution is to have the 1 mytable table, partitioned by logname, AND create views for each logname table, with only the subset of columns needed in each.

Upvotes: 1

brandon.bell
brandon.bell

Reputation: 1411

Yes, you could point multiple tables to the same location on HDFS. However, Hive doesn't support dynamic columns.

Is there a reason you can't just have 3 different tables? This would allow you do have different schemas (columns) for each.

--Brandon

Upvotes: 0

Related Questions