kiran
kiran

Reputation: 83

how to update multiple columns in hive table which is bucketed and in orc format

I can update single column but i can't update multiple columns which is referenced from another table.

I enabled all ACID properties to support hive (1.2.1 ) table updates.

I have two tables,

Table1:

schema:

create table table1(emp_id int,emp_name string, etime timestamp);

data:

emp_id | emp_name | etime

1 | david | 2016-01-08 12:27:30

2 | john | 2016-01-08 12:27:45

Table 2

schema :

create table table2(emp_id int,emp_name string,etime timestamp) set clustered by (emp_id) into 4 buckets stored as orc TBLPROPERTIES('transactional'='true');

data:

emp_id | emp_name | etime

1 | davi | 2016-01-08 12:02:30

2 | johny | 2016-01-08 11:20:45

3 | kiran | 2016-01-08 11:01:36

Now i want to update all columns in table2 based on the table1 data like comparing emp_id of both the tables. I want output as following

Table2:

emp_id | emp_name | etime

1 | david | 2016-01-08 12:27:30

2 | john | 2016-01-08 12:27:45

3 | kiran | 2016-01-08 11:01:36

Upvotes: 1

Views: 1814

Answers (2)

mgm_data
mgm_data

Reputation: 57

Updating all the columns doesn't make sense in Hive. Hive is different from RDBMS. If your above table was a billion row, 200 column table then the process would be super slow and not an efficient way of using Hive. You can use Hbase for that and pull data from Hbase into Hive. However, in Hive the approach I use for the above scenario is to have a is_valid (Binary) field in the second table.

Example:

Table 2

schema :

create table table2(emp_id int,emp_name string,etime timestamp, is_vald boolean) set clustered by (emp_id) into 4 buckets stored as orc TBLPROPERTIES('transactional'='true');

data:

emp_id | emp_name | etime | is_valid

1 | davi | 2016-01-08 12:02:30 | true

2 | johny | 2016-01-08 11:20:45 | true

3 | kiran | 2016-01-08 11:01:36 | true

Use a left outer join to identify your changed records (I would suggest using a view here)

create view update_emp_id_vw as select t1.emp_id from table1 t1 left semi join table2 t2 on t1.emp_id = t2.emp_id

update table2 set is_valid=flase where emp_id in (select emp_id from update_emp_id_vw)

insert into table2 select emp_id , emp_name , etime , true from table1 where emp_id in (select emp_id from update_emp_id_vw)

Table2:

emp_id | emp_name | etime | is_valid

1 | davi | 2016-01-08 12:02:30 | false

2 | johny | 2016-01-08 11:20:45 | false

3 | kiran | 2016-01-08 11:01:36 | true

1 | david | 2016-01-08 12:27:30 | true

2 | john | 2016-01-08 12:27:45 | true

Later on you can query only the fields where is_valid = true. If you do not want the history in table2 you can delete the rows where is_valid = false.

There are several other logic you can implement. This is an option if you want to use it as a slowly changing dimension. you can add current_timestamp field to table2 if you want to track the history. (more info can be found in this blog: https://hortonworks.com/hadoop-tutorial/using-hive-acid-transactions-insert-update-delete-data/)

Hope this helps...

Upvotes: 0

JassyJov
JassyJov

Reputation: 204

Try this:

update table2 set t2.etime = t1.etime from table2 t2 left outer join table1 t1 on t1.emp_id = t2.emp_id where t1.emp_id is not null

Upvotes: 0

Related Questions