Akash
Akash

Reputation: 359

How to make a generic pipeline for data transformation using Azure Databricks and Data Factory

I have a requirement to create a GUI to get some user input and also they can import a CSV file from GUI. Once the file is imported, I want to do data transformation on that file using Azure databricks(pyspark) and store the transformed data somewhere so that the user can download the transformed data. I would like to know how to make it a generic pipeline so that anyone in the organization can upload their file(it can have different columns and datatypes) and databricks does the transformation and stores the result. And for all these activities I want to leverage the Azure platform.

Upvotes: 0

Views: 435

Answers (1)

Murray Foxcroft
Murray Foxcroft

Reputation: 13745

Your questions is quite vague, but here are some pointers.

Build your UI to upload the file to a folder in ADLS Gen2 blob storage. Example here. Your ASP.NET application can then kick off a databricks notebook using the Jobs API to do the transformations. Alternatively you can use Event Grid in Azure as an alternative to detect the new file and process it. If there are features in ADF (Azure data factory) that you need in addition to databricks, you can kick off an ADF job through an upload. Your ADF can also call databricks using the databricks activity.

Since all of the above are asynchronous to your web application, you will need to notify your user of the file becoming available. You can have your UI detect the new file based on a convention and/or metadata, or call Sendgrid at the end of the databricks job (or via Event Grid) to send a notification email.

So, there are a few options. Keep it simple :)

Upvotes: 1

Related Questions