Azure/Databricks - Best way to ingest data?

Question

I'm new to Azure & Databricks. I've been watching training videos and do have cloud experience with AWS. However I'm on a time crunch so help would be appreciated. I have multiple data sources I need live data ingest (via API calls/database connection) into azure and run transformations/ML in Databricks. I will likely need to output the cleaned dataframe(s) into a DW or sql database that will have a BI connection. If someone with experience in Azure Databricks can help with recommending which products I need, that would be terrific. Note this is not 'big data' (Only 100,000 rows max) but will need a compute capacity to run ML (NLP) quickly.

1. ELT/ETL - Should I go Datafactory -> Databricks. Or maybe Kafka -> blob storage -> Databricks?
2. Recommended worker type size for live data processing / NLP application?

Hubert Dudek · Accepted Answer

As it is begining of the project, the simplest way is just to write notebooks in Databricks and connect to source and Load data to dbfs storage than process that data again in Databricks (ML etc).

If it is small dataset just take simplest 1 worker + driver for that.

In future you can always upgrade worker type and set to run notebooks jobs via Data Factory.

Azure/Databricks - Best way to ingest data?

Answers (2)

Related Questions