Reputation: 11
We just went ahead with a deployment for one of our Hive based table. We renamed our column risk_old to risk_new (renamed). The table is period partitioned. However post deployment, we saw a strange behaviour that all data for that renamed column for all prior periods have been deleted/removed by null. Only newer period data is having values. How is this possible? How can we get back the data.
Sample data prior deployment:
id | risk_old | period |
---|---|---|
1234 | high | 20230927 |
2345 | low | 20230927 |
1234 | high | 20230928 |
2345 | low | 20230928 |
Steps for deployment:
Post deployment data:
id | risk_new | period |
---|---|---|
1234 | 20230927 | |
2345 | 20230927 | |
1234 | 20230928 | |
2345 | 20230928 | |
1234 | high | 20230929 |
2345 | low | 20230929 |
We are looking on how the new column risk_new can also populate all periods data? We cannot revert back changes since all downstream apps have also changed their code renaming columns to fetch data from our table. One catch in UAT, if we try altering back table name from risk_new back to risk_old, it is showing older data but not the newer one.
Any leads in how to retain the name column name with all periods of data? Thanks in advance
Upvotes: 1
Views: 159
Reputation: 7407
Renaming columns in hive with tables stored as parquet doesnt work properly. Because the schema info is stored in a file and the schema info changes when you fire alter table statement but underlying data doesnt change. This shows odd/old values for that column.
Clean solution would be -
Create a table with new definition from old table and rename columns while selecting
create table newtable as select id, risk_old as risk_new, period from oldtable ;
Drop old table drop table oldtable
.
Alter new table and rename to oldtable. alter table newtable rename to oldtable;
Upvotes: 0