Understanding ETL processes

Question

ETL seems to be a pretty common task. I am basically reading some ETL mistakes which designers make with very large data on http://it.toolbox.com/blogs/infosphere/17-mistakes-that-etl-designers-make-with-very-large-data-19264

I need some practical insights for the following points

a) Incorporating Inserts, Updates, and Deletes in to the same data flow / same process.. How is that a problem?

b) Sourcing multiple systems at the same time, depending on heterogeneous systems of data.

c) Not producing the correct indexes on the sources/ lookups that need to be accessed.

d) Believing that ‘ I need to process all the data in one pass because it’s the fastest way to do it ‘

Any help?

Nick.Mc · Accepted Answer

A) It's a problem if you find the task takes too long to complete (due to increased data volumes), and then it becomes too difficult to technically split them out afterwards. But splitting the tasks out can increase the possibility of inconsistent data loads (i.e. your DELETE works but your insert fails, meaning you are missing a load of data)

B) I don't understand 'at the same time' here - Do you mean simultaneously? You could max out bandwidth (network, disk etc.) if you simultaneously try to load data from many systems. Sometimes you don't have a choice if you need to load that data at offline times.

C) Yes incorrect indexes will slow down access. But often vendors don't like you creating indexes in the source database.

D) Performance tuning (the fastest way to do it) is a complex topic. In some cases it might be faster to do it in one pass. In other cases it may not.

Understanding ETL processes

Answers (2)

Related Questions