What to use to serve as an intermediary data source in ETL job?

Question

I am creating an ETL pipeline that uses variety of sources and sends the data to Big Query. Talend cannot handle both relational and non relational database components in one job for my use case so here's how i am doing it currently:

JOB 1 --Get data from a source(SQL Server, API etc), transform it and store transformed data in a delimited file(text or csv) JOB 1 -- Use the stored transformed data from delimited file in JOB 1 as source and then transform it according to big query and send it.

I am using delimited text file/csv as intermediary data storage to achieve this.Since confidentiality of data is important and solution also needs to be scalable to handle millions of rows, what should i use as this intermediary source. Will a relational database help? or delimited files are good enough? or anything else i can use?

PS- I am deleting these files as soon as the job finishes but worried about security till the time job runs, although will run on safe cloud architecture. Please share your views on this.

Jose Bagatelli · Accepted Answer

In Data Warehousing architecture, it's usually a good practice to have the staging layer to be persistent. This gives you among other things, the ability to trace the data lineage back to source, enable to reload your final model from the staging point when business rules change as well as give a full picture about the transformation steps the data went through from all the way from landing to reporting.

I'd also consider changing your design and have the staging layer persistent under its own dataset in BigQuery rather than just deleting the files after processing.

Since this is just a operational layer for ETL/ELT and not end-user reports, you will be paying only for storage for the most part.

Now, going back to your question and considering your current design, you could create a bucket in Google Cloud Storage and keep your transformation files there. It offers all the security and encryption you need and you have full control over permissions. Big Query works seemingly with Cloud Storage and you can even load a table from a Storage file straight from the Cloud Console.

All things considered, whatever the direction you chose I recommend to store the files you're using to load the table rather than deleting them. Sooner or later there will be questions/failures in your final report and you'll likely need to trace back to the source for investigation.

In a nutshell. The process would be.

|---Extract and Transform---|----Load----|
  Source  ---> Cloud Storage --> BigQuery

What to use to serve as an intermediary data source in ETL job?

Answers (2)

Related Questions