My SQL MERGE statement runs for too long

Question

I have this Hive MERGE statement:

MERGE INTO destination dst
USING (
  SELECT

   -- DISTINCT fields
      company
    , contact_id as id
    , ct.cid as cid

     -- other fields
    , email
    , timestamp_utc
    -- there are actually about 6 more 

    -- deduplication
    , ROW_NUMBER() OVER (
         PARTITION BY company
       , ct.id
       , contact_id
         ORDER BY timestamp_utc DESC
    ) as r

  FROM
    source
  LATERAL VIEW explode(campaign_id) ct AS cid
) src
ON
        dst.company = src.company
    AND dst.campaign_id = src.cid
    AND dst.id = src.id

-- On match: keep latest loaded
WHEN MATCHED
    AND dst.updated_on_utc < src.timestamp_utc
    AND src.r = 1
THEN UPDATE SET
    email =  src.email
  , updated_on_utc = src.timestamp_utc

WHEN NOT MATCHED AND src.r = 1 THEN INSERT VALUES (
    src.id

  , src.email

  , src.timestamp_utc

  , src.license_name
  , src.cid
)
;

Which runs for a very long time (30 minutes for 7GB of avro compressed data on disk). I wonder if there are any SQL ways to improve it.

ROW_NUMBER() is here to deduplicate the source table, so that in the MATCH clause we only select the earliest row.

One thing I am not sure of, is that hive says:

SQL Standard requires that an error is raised if the ON clause is such that more than 1 row in source matches a row in target. This check is computationally expensive and may affect the overall runtime of a MERGE statement significantly. hive.merge.cardinality.check=false may be used to disable the check at your own risk. If the check is disabled, but the statement has such a cross join effect, it may lead to data corruption.

I do indeed disable the cardinality check, as although the ON statement might give 2 rows in source, those rows are limited to 1 only thanks to the r=1 later in the MATCH clause.

Overall I like this MERGE statement but it is just too slow and any help would be appreciated.

Note that the destination table is partitioned. The source table is not as it is an external table which for every run must be fully merged, so fully scanned (in the background already merged data files are removed and new files are added before next run). Not sure that partitioning would help in that case

What I have done:

play with hdfs/hive/yarn configuration
try with a temporary table (2 steps) instead of a single MERGE, the run time jumped to more than 2 hours.

My SQL MERGE statement runs for too long

Answers (1)

Related Questions