Reputation: 362
I need to merge/concat two files which are present in my Azure Storage Account using an ADF pipeline.
There is a merge option in copy activity of azure, but this will merge similar files. The file has column with pipe separated data
My requirement is: Append the second file columns to the first file columns based on a particular ID
Example>>
File1
ID|Name|Age|XX|YY
001|Abc|20|x1|y1
002|Dfg|30|x2|y2
File2
ID|AA|BB
001|a1|b1
002|a2|b2
Now the output of my ADF activity can be a different file or in file 1 or 2 with the below fomat
OutputFile
ID|Name|Age|XX|YY|AA|BB
001|Abc|20|x1|y1|a1|b1
002|Dfg|30|x2|y2|a2|b2
Note: the above files were combined based on ID.
How can the same be achieved using adf pipeline activity?
As mentioned earlier, have tried the merge copy activity behavior. This will not satisfy my requirements.
azure data factory: how to merge all files of a folder into one file
Upvotes: 0
Views: 1366
Reputation: 362
Though I was not able to find a solution using ADF, came up with a Common Table Expression(CTE) which will help me combine those in the db and then extract them to a file
Refer this for CTE - https://www.geeksforgeeks.org/cte-in-sql/
Upvotes: 0
Reputation: 7728
It sounds like the Merge Copy functions like a UNION in SQL, so it makes since this wouldn't work for your scenario. Given the constraints you've mentioned, specifically the lack of Data Flow, you're going to have to do some custom coding somewhere. I have two ideas on how I would tackle this problem.
You mentioned in your comments that the original data files were pulled from SQL Server using Copy. That means (assuming they are from the same server) that you could solve the problem on the source side by writing a Stored Procedure that a) Unpivots the two tables into temp tables then b) Pivots the results into a staging table. That Staging table would then be the Source for your Copy activity. I would prefer this route because the SQL engine would be ideal for this kind of work: when all the data I'm operating on is in SQL Server, I try to put the work there as well. Also, Stored Procedures are really easy to work with in Data Factory.
1a. If you are against the Staging table concept, then your Stored Procedure could use Polybase to write the results out to Blob storage instead, in which case there would be no Copy activity in your pipeline.
If you must use the text files you previously produced, then an Azure Batch job with custom code to read the blobs, perform the work, and output the target blob is probably your best alternative. This would be my choice of last resort because I find Azure Batch tedious and difficult to work with, but sometimes only custom code will do.
If you end up doing something else, I'd be very curious to learn how you solve this problem.
Upvotes: 1