Veeru Chow
Veeru Chow

Reputation: 101

how to add columns to existing hive partitioned table?

alter table abc add columns (stats1 map<string,string>, stats2 map<string,string>)

i have altered my table with above query. But after while checking the data i got NULL's for the both extra columns. I'm not getting data.

screenshot

Upvotes: 10

Views: 36644

Answers (4)

Eapen Jose
Eapen Jose

Reputation: 111

This solution only works if your data is partitioned and you know the location of the latest partition. In this case instead of doing a recover partition or a repair which is a costly operation, you can do something like:

  • Read the partitioned table and get the schema details
  • Read the table you want to update
  • Now find which all columns are different and do a alter table for each

Posting a scala code for reference:

def updateMetastoreColumns(spark: SparkSession, partitionedTablePath: String, toUpdateTableName: String): Unit = {
    //fetch all column names along with their corresponding datatypes from latest partition
    val partitionedTable = spark.read.orc(partitionedTablePath)
    val partitionedTableColumns = partitionedTable.columns zip partitionedTable.schema.map(_.dataType.catalogString)

    //fetch all column names along with their corresponding datatypes from currentTable
    val toUpdateTable = spark.read.table(toUpdateTableName)
    val toUpdateTableColumns = toUpdateTable.columns zip toUpdateTable.schema.map(_.dataType.catalogString)

    //check if new columns are present in newer partition
    val diffColumns = partitionedTableColumns.diff(toUpdateTableColumns)

    //update the metastore with new column info
    diffColumns.foreach {column: (String, String) => {
      spark.sql(s"ALTER TABLE ${toUpdateTableName} ADD COLUMNS (${column._1} ${column._2})")
    }}
  }

This will help you dynamically find latest columns which are added to newer partition and update it to your metastore on the fly.

Upvotes: 1

DVL
DVL

Reputation: 364

As others have noted CASCADE will change the metadata for all partitions. Without CASCADE, if you want to change old partitions to include the new columns, you'll need to DROP the old partitions first and then fill them, INSERT OVERWRITE without the DROP won't work, because the metadata won't update to the new default metadata.

Let's say you have already run alter table abc add columns (stats1 map<string,string>, stats2 map<string,string>) without CASCADE by accident and then you INSERT OVERWRITE an old partition without DROPPING first. The data will be stored in the underlying files, but if you query that table from hive for that partition, it won't show because the metadata wasn't updated. This can be fixed without having to rerun the insert overwrite using the following:

  1. Run SHOW CREATE TABLE dbname.tblname and copy all the column definitions that existed before adding new columns
  2. Run ALTER TABLE dbname.tblname REPLACE COLUMNS ({paste in col defs besides columns to add here}) CASCADE
  3. Run ALTER TABLE dbname.tblname ADD COLUMNS (newcol1 int COMMENT "new col") CASCADE
  4. be happy that the metadata has been changed for all partitions =)

As an example of steps 2-3:

DROP TABLE IF EXISTS  junk.testcascade ;
CREATE TABLE junk.testcascade (
startcol INT
)
partitioned by (d int)
stored as parquet
;
INSERT INTO TABLE junk.testcascade PARTITION(d=1)
VALUES
    (1),
    (2)
;

INSERT INTO TABLE junk.testcascade PARTITION(d=2)
VALUES
    (1),
    (2)
;

SELECT * FROM junk.testcascade ;
+-----------------------+----------------+--+
| testcascade.startcol  | testcascade.d  |
+-----------------------+----------------+--+
| 1                     | 1              |
| 2                     | 1              |
| 1                     | 2              |
| 2                     | 2              |
+-----------------------+----------------+--+

 --no cascade! opps
ALTER TABLE junk.testcascade ADD COLUMNS( testcol1 int, testcol2 int) ;

INSERT OVERWRITE TABLE junk.testcascade PARTITION(d=3)
VALUES
    (1,1,1),
    (2,1,1)
;

INSERT OVERWRITE TABLE junk.testcascade PARTITION(d=2)
VALUES
    (1,1,1),
    (2,1,1)
;

--okay! because we created this table after altering the metadata
select * FROM junk.testcascade where d=3;
+-----------------------+-----------------------+-----------------------+----------------+--+
| testcascade.startcol  | testcascade.testcol1  | testcascade.testcol2  | testcascade.d  |
+-----------------------+-----------------------+-----------------------+----------------+--+
| 1                     | 1                     | 1                     | 3              |
| 2                     | 1                     | 1                     | 3              |
+-----------------------+-----------------------+-----------------------+----------------+--+

--not okay even tho we inserted =( because the metadata isnt changed
select * FROM junk.testcascade where d=2;
+-----------------------+-----------------------+-----------------------+----------------+--+
| testcascade.startcol  | testcascade.testcol1  | testcascade.testcol2  | testcascade.d  |
+-----------------------+-----------------------+-----------------------+----------------+--+
| 1                     | NULL                  | NULL                  | 2              |
| 2                     | NULL                  | NULL                  | 2              |
+-----------------------+-----------------------+-----------------------+----------------+--+

--cut back to original columns
ALTER TABLE junk.testcascade REPLACE COLUMNS( startcol int) CASCADE;

--add
ALTER table junk.testcascade ADD COLUMNS( testcol1 int, testcol2 int) CASCADE;

--it works!
select * FROM junk.testcascade where d=2; 
+-----------------------+-----------------------+-----------------------+----------------+--+
| testcascade.startcol  | testcascade.testcol1  | testcascade.testcol2  | testcascade.d  |
+-----------------------+-----------------------+-----------------------+----------------+--+
| 1                     | 1                     | 1                     | 2              |
| 2                     | 1                     | 1                     | 2              |
+-----------------------+-----------------------+-----------------------+----------------+--+

Upvotes: 8

Pravat Sutar
Pravat Sutar

Reputation: 151

CASCADE is the solution.

Query:

ALTER TABLE dbname.table_name ADD columns (column1 string,column2 string) CASCADE; 

This changes the columns of a table's metadata and cascades the same change to all the partition metadata. RESTRICT is the default, limiting column change only to table metadata.

Upvotes: 11

leftjoin
leftjoin

Reputation: 38335

To add columns into partitioned table you need to recreate partitions. Suppose the table is external and the datafiles already contain new columns, do the following: 1. Alter table add columns... 2. Recreate partitions. For each partitions do Drop then create. Newly created partition schema will inherit the table schema.

Alternatively you can drop the table then create table and create all partitions or restore them simply running MSCK REPAIR TABLE abc command. The equivalent command on Amazon Elastic MapReduce (EMR)'s version of Hive is: ALTER TABLE table_name RECOVER PARTITIONS. See manual here: RECOVER PARTITIONS

Also in Hive 1.1.0 and later you can use CASCADE option of ALTER TABLE ADD|REPLACE COLUMNS. See manual here: ADD COLUMN

These suggestions work for external tables.

Upvotes: 5

Related Questions