mussi89
mussi89

Reputation: 99

Do you have to use Azure Data Factory or can you just Databricks as your ETL tool from your multiple sources?

...Or do i need to add the data into a data lake using data factory first and then use databricks as an ELT?

Upvotes: 0

Views: 702

Answers (2)

DennisZ
DennisZ

Reputation: 113

Indeed it depends to the scenario I think. If you have a wide variety of datascources you need to connect to then adf is probably the better option.

If your sources are datafiles (in any format) you could consider using databricks for etl.

I use databricks as a pure etl tool (without adf) by mounting a notebook to a storage container in a blobstorage, take huge xml data from there and write the data to a dataframe in databricks. Then I parse the shape of the dataframe and then writing the data into an azure sql database. Fair to say I’m not really using it for the “e” in etl, as the data has already been extracted from the real source system.

Big advantage is the power you have at your disposal to parse the files.

Best regards.

Upvotes: 0

databash
databash

Reputation: 646

Depends.

Databricks can connect to datasources and ingest data. However Azure Data Factory(ADF) have more connectors than databricks. So it depends on what you need. If using ADF, you need to land the data somewhere (i.e. Azure storage) so that databricks can pick it up.

Moreover, another main feature of ADF is to orchestrate data movement or activity. Databricks do have Job feature to schedule notebooks or JAR, however it is limited within databricks. If you want to schedule anything outside of databricks (e.g. drop file to SFTP or email on completion or terminate databricks cluster etc...) then ADF is the way to go.

Upvotes: 1

Related Questions