Jan Lender
Jan Lender

Reputation: 58

Azure Data Factory V2: Copy OR Stored Procedure activity for SQL merge

We have number of DB table merge steps in our Azure Data Factory v2 solution. We merge tables in a single instance of Azure SQL Server DB. Source tables and target tables are in different DB schemas. Sources are defined either as a select over single table or as a join of two tables.

My doubt is which one of scenarios described bellow is better from the performance perspective.

Scenario one (per table)

Stored Procedure activity invokes a stored procedure that performs all the work. A Stored Procedure activity in a pipeline invokes that stored procedure. The upserts the target table with all source data. An example of such a stored procedure:

create or alter procedure dwh.fill_lnk_cemvypdet_cemstr2c_table_with_stage_data as
    merge
        dwh.lnk_cemvypdet_cemstr2c as target
    using

        (select
                t.sa_hashkey cemvypdet_hashkey,
                t.sa_timestamp load_date,
                t.sa_source record_source,
                d.sa_hashkey cemstr2c_hashkey
            from
                egje.cemvypdet t
            join
                egje.cemstr2c d
            on
                t.id_mstr = d.id_mstr)
        as source
        on target.cemvypdet_hashkey = source.cemvypdet_hashkey
            and target.cemstr2c_hashkey = source.cemstr2c_hashkey
        when not matched then
            insert(
                cemvypdet_hashkey,
                cemstr2c_hashkey,
                record_source,
                load_date,
                last_seen_date)
            values(
                source.cemvypdet_hashkey,
                source.cemstr2c_hashkey,
                source.record_source,
                source.load_date,
                source.load_date)
        when matched then
            update set last_seen_date = source.load_date;

Scenario two (per row)

A Copy activity declares a stored procedure to invoke in Target tab so that the activity invokes the stored procedure for every single row of the source.

create or alter procedure dwh.fill_lnk_cemvypdet_cemstr2c_subset_table_row_with_stage_data
@lnk_cemvypdet_cemstr2c_subset dwh.lnk_cemvypdet_cemstr2c_subset_type readonly
as
    merge
        dwh.lnk_cemvypdet_cemstr2c_subset as target
    using

    @lnk_cemvypdet_cemstr2c_subset
        as source
        on target.cemvypdet_hashkey = source.cemvypdet_hashkey
            and target.cemstr2c_hashkey = source.cemstr2c_hashkey
        when not matched then
            insert(
                hashkey,
                cemvypdet_hashkey,
                cemstr2c_hashkey,
                record_source,
                load_date,
                last_seen_date)
            values(
                source.hashkey,
                source.cemvypdet_hashkey,
                source.cemstr2c_hashkey,
                source.record_source,
                source.load_date,
                source.load_date)
        when matched then
            update set last_seen_date = source.load_date;

The type @lnk_cemvypdet_cemstr2c_subset is defined as a table type that follows structure of the target table.

Upvotes: 1

Views: 1749

Answers (1)

Alberto Morillo
Alberto Morillo

Reputation: 15628

Scenario 1 should have better performance but taking the following optimizations in consideration:

  1. Create an index on the join columns in the source table that is unique and covering.
  2. Create a unique clustered index on the join columns in the target table.
  3. Parameterize all literal values in the ON clause and in the the WHEN clauses.
  4. Merge subsets of data from the source to the target table by using OFFSET and ROWS FETCH NEXT or by defining views on the source or target that return the filtered rows and reference the view as the source or target table. Furthermore the use of the WITH clause of the TOP clause to filter out rows from the source or target tables is not recommended because they can generate incorrect results.
  5. To further optimize the merge operation try different batch sizes. Here is the reason.

Upvotes: 1

Related Questions