Reputation: 101
alter table abc add columns (stats1 map<string,string>, stats2 map<string,string>)
i have altered my table with above query. But after while checking the data i got NULL's for the both extra columns. I'm not getting data.
Upvotes: 10
Views: 36644
Reputation: 111
This solution only works if your data is partitioned and you know the location of the latest partition. In this case instead of doing a recover partition or a repair which is a costly operation, you can do something like:
Posting a scala code for reference:
def updateMetastoreColumns(spark: SparkSession, partitionedTablePath: String, toUpdateTableName: String): Unit = {
//fetch all column names along with their corresponding datatypes from latest partition
val partitionedTable = spark.read.orc(partitionedTablePath)
val partitionedTableColumns = partitionedTable.columns zip partitionedTable.schema.map(_.dataType.catalogString)
//fetch all column names along with their corresponding datatypes from currentTable
val toUpdateTable = spark.read.table(toUpdateTableName)
val toUpdateTableColumns = toUpdateTable.columns zip toUpdateTable.schema.map(_.dataType.catalogString)
//check if new columns are present in newer partition
val diffColumns = partitionedTableColumns.diff(toUpdateTableColumns)
//update the metastore with new column info
diffColumns.foreach {column: (String, String) => {
spark.sql(s"ALTER TABLE ${toUpdateTableName} ADD COLUMNS (${column._1} ${column._2})")
}}
}
This will help you dynamically find latest columns which are added to newer partition and update it to your metastore on the fly.
Upvotes: 1
Reputation: 364
As others have noted CASCADE
will change the metadata for all partitions. Without CASCADE
, if you want to change old partitions to include the new columns, you'll need to DROP
the old partitions first and then fill them, INSERT OVERWRITE
without the DROP
won't work, because the metadata won't update to the new default metadata.
Let's say you have already run alter table abc add columns (stats1 map<string,string>, stats2 map<string,string>)
without CASCADE
by accident and then you INSERT OVERWRITE
an old partition without DROPPING first. The data will be stored in the underlying files, but if you query that table from hive for that partition, it won't show because the metadata wasn't updated. This can be fixed without having to rerun the insert overwrite using the following:
SHOW CREATE TABLE dbname.tblname
and copy all the column definitions that existed before adding new columnsALTER TABLE dbname.tblname REPLACE COLUMNS ({paste in col defs besides columns to add here}) CASCADE
ALTER TABLE dbname.tblname ADD COLUMNS (newcol1 int COMMENT "new col") CASCADE
As an example of steps 2-3:
DROP TABLE IF EXISTS junk.testcascade ;
CREATE TABLE junk.testcascade (
startcol INT
)
partitioned by (d int)
stored as parquet
;
INSERT INTO TABLE junk.testcascade PARTITION(d=1)
VALUES
(1),
(2)
;
INSERT INTO TABLE junk.testcascade PARTITION(d=2)
VALUES
(1),
(2)
;
SELECT * FROM junk.testcascade ;
+-----------------------+----------------+--+
| testcascade.startcol | testcascade.d |
+-----------------------+----------------+--+
| 1 | 1 |
| 2 | 1 |
| 1 | 2 |
| 2 | 2 |
+-----------------------+----------------+--+
--no cascade! opps
ALTER TABLE junk.testcascade ADD COLUMNS( testcol1 int, testcol2 int) ;
INSERT OVERWRITE TABLE junk.testcascade PARTITION(d=3)
VALUES
(1,1,1),
(2,1,1)
;
INSERT OVERWRITE TABLE junk.testcascade PARTITION(d=2)
VALUES
(1,1,1),
(2,1,1)
;
--okay! because we created this table after altering the metadata
select * FROM junk.testcascade where d=3;
+-----------------------+-----------------------+-----------------------+----------------+--+
| testcascade.startcol | testcascade.testcol1 | testcascade.testcol2 | testcascade.d |
+-----------------------+-----------------------+-----------------------+----------------+--+
| 1 | 1 | 1 | 3 |
| 2 | 1 | 1 | 3 |
+-----------------------+-----------------------+-----------------------+----------------+--+
--not okay even tho we inserted =( because the metadata isnt changed
select * FROM junk.testcascade where d=2;
+-----------------------+-----------------------+-----------------------+----------------+--+
| testcascade.startcol | testcascade.testcol1 | testcascade.testcol2 | testcascade.d |
+-----------------------+-----------------------+-----------------------+----------------+--+
| 1 | NULL | NULL | 2 |
| 2 | NULL | NULL | 2 |
+-----------------------+-----------------------+-----------------------+----------------+--+
--cut back to original columns
ALTER TABLE junk.testcascade REPLACE COLUMNS( startcol int) CASCADE;
--add
ALTER table junk.testcascade ADD COLUMNS( testcol1 int, testcol2 int) CASCADE;
--it works!
select * FROM junk.testcascade where d=2;
+-----------------------+-----------------------+-----------------------+----------------+--+
| testcascade.startcol | testcascade.testcol1 | testcascade.testcol2 | testcascade.d |
+-----------------------+-----------------------+-----------------------+----------------+--+
| 1 | 1 | 1 | 2 |
| 2 | 1 | 1 | 2 |
+-----------------------+-----------------------+-----------------------+----------------+--+
Upvotes: 8
Reputation: 151
CASCADE is the solution.
Query:
ALTER TABLE dbname.table_name ADD columns (column1 string,column2 string) CASCADE;
This changes the columns of a table's metadata and cascades the same change to all the partition metadata.
RESTRICT
is the default, limiting column change only to table metadata.
Upvotes: 11
Reputation: 38335
To add columns into partitioned table you need to recreate partitions. Suppose the table is external and the datafiles already contain new columns, do the following: 1. Alter table add columns... 2. Recreate partitions. For each partitions do Drop then create. Newly created partition schema will inherit the table schema.
Alternatively you can drop the table then create table and create all partitions or restore them simply running MSCK REPAIR TABLE abc
command. The equivalent command on Amazon Elastic MapReduce (EMR)'s version of Hive is: ALTER TABLE table_name RECOVER PARTITIONS
.
See manual here: RECOVER PARTITIONS
Also in Hive 1.1.0 and later you can use CASCADE
option of ALTER TABLE ADD|REPLACE COLUMNS
. See manual here: ADD COLUMN
These suggestions work for external tables.
Upvotes: 5