What provides higher performance? Writing a query using T-SQL, joining tables, then inserting the result into another table Using Pentaho Spoon's table insert, then using database lookup to "join" each table at a time, then inserting the result into another table The objective is to take a denormalized table, join it with 5 dimension tables by their text, and retrieve dimensions' PKs, and then insert the result into a fact table.

etllookupdata-warehousepentaho-spoonstar-schema

Reputation: 3947

SQL Server join or Pentaho Spoon lookup?

What provides higher performance?

Writing a query using T-SQL, joining tables, then inserting the result into another table
Using Pentaho Spoon's table insert, then using database lookup to "join" each table at a time, then inserting the result into another table

The objective is to take a denormalized table, join it with 5 dimension tables by their text, and retrieve dimensions' PKs, and then insert the result into a fact table.

Upvotes: 1

Answers (2)

AlainD

Reputation: 6356

It is conventional wisdom to think that SQL outperforms Pentaho PDI on complex queries. The truthiness comes from the blind believe that the SQL optimizer gives a real optimum.

I have a number of counter examples in which we have reduced the query time of more than one hour to a few minutes by extracting the SQL query complexity out into a series of lookups and filters.

We were better because:

The lookup expects one matching record per entry, while the SQL optimizer must take the assumption that the join is not unique. And it is the case of unfolding a star/snowflake schema like here.
The lookup step is really smart, reading just the data needed and keeping it in memory, making provision with internal sorted hashtables to speed up forthcoming queries.
The above is especially efficient when the flow is known to be sorted. And while a select from oneTable order by is quick especially when the table is suitably indexed , the same select from manyJoinedTables where LotsOfConditions order by may be pretty inefficient because the SQL cannot count on indexes.

In facts, I guess the above conditions are exactly the ones which the SQL optimizer wish to find and rely, but cannot because of generality.

As a rule of thumb be confident in the efficiency of PDI. Matt Casters and Jens Bleuel made a very good software which was tested under volume conditions you cannot even imagine.

So use the solution which is the easier to maintain (most the time PDI lookups) and if it is really, really to slow, then move it into Input Tables but do not expect to be systematically better.

Notes:

Avoid the Database Lookup (prepared statement uses cache, but we are in precisely the case in which we look for a different key each time).
Avoid Joins, i.e.: explicitly tell kettle it can count on a unique match, if you know it is the case. The Join Rows and Merge Join are efficient steps, but only when the incoming flows are sorted.
Use Filters (reduce the number of rows) as soon as possible. Even, every rule has its exception, in the SQL.
Don't bother to reduce the number of columns with Select values. It has almost no impact on the speed! You do not thing Kettle is naively rewriting the values from step to step, instead of using a clever system of pointers, don't you?.
Computations with a JavaScript is not so inefficient as the legend says, and in facts the PDI is usually much more busy in sorting and lookups.
Do not spread the aggregates in many Memory Group by steps. Each of these steps needs to read all the incoming flow before to know it is finished, so it is a blocking factor the next steps.
Usually the Sorted Group by does not improve the Memory Group by. The one exception is when the memory reach its quota and java starts to launch a garbage collector over garbage collector. In that case, a sort can be used to store data on temporary disk.
Avoid intermediary tables. Instead build the flow by adding columns and when the data is ready, throw it in a Output Table with a large commit size.

Upvotes: 0

Dirk Trilsbeek

Reputation: 6023

probably better suited for dba.stackexchange.com. But I guess a database engine is going to perform this task much faster, because a) it can optimize access to all tables involved using indexes and table statistics and b) you get rid of the overhead an ETL tool and multiple database queries introduce. Pentaho PDI processes rows individually, so for each row coming from your table input step you will have an SQL query for every lookup step.

Upvotes: 1

SQL Server join or Pentaho Spoon lookup?

Answers (2)

Related Questions