Reputation: 4675
Our team has deployed a Python script on Azure Machine Learning (AML) to process files stored on an Azure storage account.
Our pipeline consists of a ForEach activity, that calls the Python script for each or the listed files. Running it from the Azure Data Factory (ADF) triggers multiple individual pipelines that run concurrently. I am not using the expression in parallel because I am unsure how these individual jobs are assigned across the different vCPUs.
In addition to the "parallelization" managed by AML, would it make sense to parallelize the processing at the Python level using Python's multiprocessing module?
I gave it a shot using the following approach, but it did not appear to reduce the overall processing time.
from multiprocessing import Process
from multiprocessing import Pool
[...]
mp_processes = 2
if mp_processes is True :
p = Pool(int(mp_processes))
output1, output2 = zip(*p.map(process, process_queue))
[...]
Upvotes: 0
Views: 62