How to move a subset of a table from one server to another using Data Factory

Question

I think I am facing a common problem when moving data between two servers but I can't find a good solution. I have two servers: server1 and server2. I want to move a subset of a table in server1 to a table in server2. I have the following tables (source and sink tables have the same schema):

server1.dbo.source - Contains ID + other columns.
server2.dbo.sink - Contains ID + other columns (same schema as source table).
server2.dbo.utilizedIds - Contains one column with list of IDs I would like to move.

If I could reference both servers in the same query, the data that I am trying to move from server1 to server2 could be described by this SQL statement:

SELECT * FROM server1.dbo.source 
WHERE server1.dbo.source.id IN (SELECT * FROM server2.dbo.utilizedIds)

What I am trying to achieve is to only move IDs that I need. These IDs are dynamic and stored in a table in my destination server. I am doing this because I need a really small subset of the source table (I need thousands of rows out of billions). So far, these are the solutions that I have found and the reasons why I can't use them or I wouldn't like to use them:

Copy all the data into a temp table in server2 and perform the filtering there - Moving this much data every time I need to update my sink is not feasible.
Perform a delta insert of the source table into server2 and then filter this table - While this would be more performant than option 1, I really don't want to incur the cost of storing billions of rows when I only need thousands.
Create temp tables/persistent tables in server1 and insert the IDs into this table, then filter using standard SQL - I am not the owner of server1, I cannot create tables or stored procedures. The temp table option was promising but temp tables don't persist across activities.
Create a dynamic query and pass the IDs as a string to the SELECT statement. Something like this: SELECT * FROM server1.dbo.source WHERE id IN ('ID1', 'ID2', 'ID3' ...) - So far, this is my best solution. However, SQL statements have a length and parameter limit, and this is not a good practice

To me, this seems like a common use case. What is the best practice or correct solution to this problem? Building my query using strings feels like a hack and not something ADF was designed for.

NiharikaMoola · Accepted Answer

In data flow you can use the Join transformation to get the matching rows of server 1 data (dbo.source) with the IDs of server 2 (dbo.utilizedIds).

Example:

Input:

Server1 – dbo.source

Server2- dbo.utilizedIds

Data flow:

Add a source transformation in data flow and connect it to the source dataset (server1 dbo.source)

Add another source transformation and connect it to server2 Id’s table (server2 dbo.utilizedIds)

Add Join transformation and perform an inner join between source1 (server1 dbo.source) and source2 (server2 dbo.utilizedIds) with join condition as Id.

Join transformation will pull all the columns from the left and right streams for the matched rows. To remove duplicate columns, use select transformation and remove the columns.

Add sink transformation and connect it to the sink table (server2 dbo.sink) to load the source rows matched Ids to the sink.

How to move a subset of a table from one server to another using Data Factory

Answers (1)

Related Questions