Guillaume
Guillaume

Reputation: 2899

My SQL MERGE statement runs for too long

I have this Hive MERGE statement:

MERGE INTO destination dst
USING (
  SELECT

   -- DISTINCT fields
      company
    , contact_id as id
    , ct.cid as cid

     -- other fields
    , email
    , timestamp_utc
    -- there are actually about 6 more 

    -- deduplication
    , ROW_NUMBER() OVER (
         PARTITION BY company
       , ct.id
       , contact_id
         ORDER BY timestamp_utc DESC
    ) as r

  FROM
    source
  LATERAL VIEW explode(campaign_id) ct AS cid
) src
ON
        dst.company = src.company
    AND dst.campaign_id = src.cid
    AND dst.id = src.id

-- On match: keep latest loaded
WHEN MATCHED
    AND dst.updated_on_utc < src.timestamp_utc
    AND src.r = 1
THEN UPDATE SET
    email =  src.email
  , updated_on_utc = src.timestamp_utc

WHEN NOT MATCHED AND src.r = 1 THEN INSERT VALUES (
    src.id

  , src.email

  , src.timestamp_utc

  , src.license_name
  , src.cid
)
;

Which runs for a very long time (30 minutes for 7GB of avro compressed data on disk). I wonder if there are any SQL ways to improve it.

ROW_NUMBER() is here to deduplicate the source table, so that in the MATCH clause we only select the earliest row.

One thing I am not sure of, is that hive says:

SQL Standard requires that an error is raised if the ON clause is such that more than 1 row in source matches a row in target. This check is computationally expensive and may affect the overall runtime of a MERGE statement significantly. hive.merge.cardinality.check=false may be used to disable the check at your own risk. If the check is disabled, but the statement has such a cross join effect, it may lead to data corruption.

I do indeed disable the cardinality check, as although the ON statement might give 2 rows in source, those rows are limited to 1 only thanks to the r=1 later in the MATCH clause.

Overall I like this MERGE statement but it is just too slow and any help would be appreciated.

Note that the destination table is partitioned. The source table is not as it is an external table which for every run must be fully merged, so fully scanned (in the background already merged data files are removed and new files are added before next run). Not sure that partitioning would help in that case

What I have done:

Upvotes: 1

Views: 1381

Answers (1)

leftjoin
leftjoin

Reputation: 38335

Option 1: Move where filter where src.r = 1 inside the src subquery and check the merge performance. This will reduce the number of source rows before merge.

Other two options do not require ACID mode. Do full target rewrite.

Option 2: Rewrite using UNION ALL + row_number (this should be the fastest one):

insert overwrite table destination 
select 
company
, contact_id as id
, ct.cid as cid
, email
, timestamp_utc
, -- add more fields 
from
(
select --dedupe, select last updated rows using row_number
s.*
, ROW_NUMBER() OVER (PARTITION BY company, ct.id , contact_id ORDER BY timestamp_utc DESC) as rn
from
(
select --union all source and target
company
, contact_id as id
, ct.cid as cid
, email
, timestamp_utc
, -- add more fields 
from source LATERAL VIEW explode(campaign_id) ct AS cid
UNION ALL
select 
company
, contact_id as id
, ct.cid as cid
, email
, timestamp_utc
,-- add more fields 
from destination
)s --union all
where rn=1 --filter duplicates
)s-- filtered dups

If source contains a lot of duplicates, you can apply additional row_number filtering to the src subquery as well before union.

One more approach using full join: https://stackoverflow.com/a/37744071/2700344

Upvotes: 1

Related Questions