chicagobeast12
chicagobeast12

Reputation: 695

Azure/Databricks - Best way to ingest data?

I'm new to Azure & Databricks. I've been watching training videos and do have cloud experience with AWS. However I'm on a time crunch so help would be appreciated. I have multiple data sources I need live data ingest (via API calls/database connection) into azure and run transformations/ML in Databricks. I will likely need to output the cleaned dataframe(s) into a DW or sql database that will have a BI connection. If someone with experience in Azure Databricks can help with recommending which products I need, that would be terrific. Note this is not 'big data' (Only 100,000 rows max) but will need a compute capacity to run ML (NLP) quickly.

1. ELT/ETL - Should I go Datafactory -> Databricks. Or maybe Kafka -> blob storage -> Databricks?
2. Recommended worker type size for live data processing / NLP application? 

Upvotes: 0

Views: 813

Answers (2)

Always ELT approach is recommended One rather than ETL in modern data handling. Data should be landed in datalake before transforming into business use case. This is always recommended pattern any distributed systems based tools..in You can refer link

Upvotes: 0

Hubert Dudek
Hubert Dudek

Reputation: 1722

As it is begining of the project, the simplest way is just to write notebooks in Databricks and connect to source and Load data to dbfs storage than process that data again in Databricks (ML etc).

If it is small dataset just take simplest 1 worker + driver for that.

In future you can always upgrade worker type and set to run notebooks jobs via Data Factory.

Upvotes: 1

Related Questions