Arun Khetarpal
Arun Khetarpal

Reputation: 307

Understanding ETL processes

ETL seems to be a pretty common task. I am basically reading some ETL mistakes which designers make with very large data on http://it.toolbox.com/blogs/infosphere/17-mistakes-that-etl-designers-make-with-very-large-data-19264

I need some practical insights for the following points

a) Incorporating Inserts, Updates, and Deletes in to the same data flow / same process.. How is that a problem?

b) Sourcing multiple systems at the same time, depending on heterogeneous systems of data.

c) Not producing the correct indexes on the sources/ lookups that need to be accessed.

d) Believing that ‘ I need to process all the data in one pass because it’s the fastest way to do it ‘

Any help?

Upvotes: 3

Views: 1552

Answers (2)

Nick.Mc
Nick.Mc

Reputation: 19184

A) It's a problem if you find the task takes too long to complete (due to increased data volumes), and then it becomes too difficult to technically split them out afterwards. But splitting the tasks out can increase the possibility of inconsistent data loads (i.e. your DELETE works but your insert fails, meaning you are missing a load of data)

B) I don't understand 'at the same time' here - Do you mean simultaneously? You could max out bandwidth (network, disk etc.) if you simultaneously try to load data from many systems. Sometimes you don't have a choice if you need to load that data at offline times.

C) Yes incorrect indexes will slow down access. But often vendors don't like you creating indexes in the source database.

D) Performance tuning (the fastest way to do it) is a complex topic. In some cases it might be faster to do it in one pass. In other cases it may not.

Upvotes: 1

user2943601
user2943601

Reputation: 31

a) Data integrity issue

b) data quality will increase and less failure for smaller chunks.

c) will take more time to complete<

d) wrong indexes can cause more time. Better have indexes based on the query you are executing. i.e what comes in the where clause of statement

e) splitting the data into smaller data sets and processing the same would be an efficient solution
Your a BITS-PILANI(WILP) student rite.

Upvotes: 3

Related Questions