Reputation: 1779
After generating ~90 different 100 mb gzip'd CSV files, I want to merge them all into a single file. Using the built-in merge option for a data copy process, it seems that it would take well over a dozen hours to do this operation.
https://i.sstatic.net/yymnW.png
How can I merge many files in blob/ADLS storage quickly with Data Factory/Synapse?
Upvotes: 0
Views: 1762
Reputation: 20302
Easy, just convert the objects to a pandas dataframe and then do the merge.
Step #1:
df1= df1.select("*").toPandas()
df2= df2.select("*").toPandas()
Step #2:
result = pd.concat([df1, df2], axis=1)
See this link for more info.
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html
Also, here's another technique to consider.
https://www.sqlservercentral.com/articles/merge-multiple-files-in-azure-data-factory
https://markcarrington.dev/2020/11/27/combining-data-with-azure-data-factory/
Upvotes: -1
Reputation: 2056
You could try a 2 step process.
Writes into Parquet are generally quick (provided you have clean data like no spaces in column names) and they are smaller in size.
Edit - ADF Data Flow is another option. If that is still not fast enough then you might have to create a Spark Notebook in synapse and write spark code. Use a spark pool size with a balance between time and cost.
Upvotes: 1