Chris Hansen
Chris Hansen

Reputation: 8685

updating a table using hive

Right now I run the following Hive query

CREATE TABLE dwo_analysis.exp_shown AS
SELECT

MIN(sc.date_time) as first_shown_time,
SUBSTR(sc.post_evar12,1,24) as guid,
sc.post_evar238 as experiment_name,
sc.post_evar239 as variant_name

FROM test

WHERE report_suite='adbemmarvelweb.prod'
AND date >= DATE_SUB(CURRENT_DATE,90) AND date < DATE_SUB(CURRENT_DATE, 2)
AND post_prop5 = 'experiment:standard:authenticated:shown'
AND post_evar238 NOT LIKE 'control%'
AND post_evar238 <> ''
AND post_evar239 <> ''

The table test is large. I would like to optimize this query by running it once, and every other time updating the table by getting the last 2 days of data and adding it to the table.

so basically run the above query once and every time run it again but with the condition

WHERE click_date >= DATE_SUB(CURRENT_DATE, 2) AND click_date < DATE_SUB(CURRENT_DATE)

How do I update the table using hive to populate the the rows as mentioned in the condition above?

Upvotes: 1

Views: 448

Answers (1)

OneCricketeer
OneCricketeer

Reputation: 192023

First, your queries would be quicker if the Hive table were partitioned based on date. Your create table statement isn't inserting into any partitions, therefore I suspect your table is not partitioned. It would also be quicker if the source data were Parquet/ORC

In any case, you can overwrite the table for a date range like so

INSERT OVERWRITE TABLE dwo_analysis.exp_shown
    SELECT * FROM test 
WHERE click_date 
  BETWEEN DATE_SUB(CURRENT_DATE, 2) AND CURRENT_DATE;

Upvotes: 1

Related Questions