Reputation: 997
I have a requirement to parse a lot of small unstructured files in near real-time inside Azure and load the parsed data into a SQL database. I chose Python (because I don't think any Spark cluster or big data would suite considering the volume of source files and their size) and the parsing logic has been already written. I am looking forward to schedule this python script in different ways using Azure PaaS
May I ask what's the implication of running a Python notebook activity from Azure Data Factory pointing to Azure Databricks? Would I be able to fully leverage the potential of the cluster (Driver & Workers)?
Also, please suggest me if you think the script has to be converted to PySpark to meet my use case requirement to run in Azure Databricks? The only hesitation here is the files are in KB and they are unstructured.
Upvotes: 1
Views: 812
Reputation: 2483
If the script is pure Python then it would only run on the driver node of the Databricks cluster making it very expensive (and slow due to cluster startup times).
You could rewrite as pyspark but if the data volumes are as low as you say then this is still expensive and slow. The smallest cluster will consume two vm’s - each with 4 cores.
I would look at using Azure Functions instead. Python is now an option: https://learn.microsoft.com/en-us/azure/python/tutorial-vs-code-serverless-python-01
Azure Functions also have great integration with Azure Data Factory so your workflow would still work.
Upvotes: 1