Reputation: 3947
What provides higher performance?
Writing a query using T-SQL, joining tables, then inserting the result into another table
Using Pentaho Spoon's table insert, then using database lookup to "join" each table at a time, then inserting the result into another table
The objective is to take a denormalized table, join it with 5 dimension tables by their text, and retrieve dimensions' PKs, and then insert the result into a fact table.
Upvotes: 1
Views: 1250
Reputation: 6356
It is conventional wisdom to think that SQL outperforms Pentaho PDI on complex queries. The truthiness comes from the blind believe that the SQL optimizer gives a real optimum.
I have a number of counter examples in which we have reduced the query time of more than one hour to a few minutes by extracting the SQL query complexity out into a series of lookups and filters.
We were better because:
The lookup expects one matching record per entry, while the SQL optimizer must take the assumption that the join is not unique. And it is the case of unfolding a star/snowflake schema like here.
The lookup step is really smart, reading just the data needed and keeping it in memory, making provision with internal sorted hashtables to speed up forthcoming queries.
The above is especially efficient when the flow is known to be sorted. And while a select from oneTable order by
is quick especially when the table is suitably indexed , the same select from manyJoinedTables where LotsOfConditions order by
may be pretty inefficient because the SQL cannot count on indexes.
In facts, I guess the above conditions are exactly the ones which the SQL optimizer wish to find and rely, but cannot because of generality.
As a rule of thumb be confident in the efficiency of PDI. Matt Casters and Jens Bleuel made a very good software which was tested under volume conditions you cannot even imagine.
So use the solution which is the easier to maintain (most the time PDI lookups) and if it is really, really to slow, then move it into Input Table
s but do not expect to be systematically better.
Notes:
Avoid the Database Lookup
(prepared statement uses cache, but we are in precisely the case in which we look for a different key each time).
Avoid Joins
, i.e.: explicitly tell kettle it can count on a unique match, if you know it is the case. The Join Rows
and Merge Join
are efficient steps, but only when the incoming flows are sorted.
Use Filters
(reduce the number of rows) as soon as possible. Even, every rule has its exception, in the SQL.
Don't bother to reduce the number of columns with Select values
. It has almost no impact on the speed! You do not thing Kettle is naively rewriting the values from step to step, instead of using a clever system of pointers, don't you?.
Computations with a JavaScript
is not so inefficient as the legend says, and in facts the PDI is usually much more busy in sorting and lookups.
Do not spread the aggregates in many Memory Group by
steps. Each of these steps needs to read all the incoming flow before to know it is finished, so it is a blocking factor the next steps.
Usually the Sorted Group by
does not improve the Memory Group by
. The one exception is when the memory reach its quota and java starts to launch a garbage collector over garbage collector. In that case, a sort can be used to store data on temporary disk.
Avoid intermediary tables. Instead build the flow by adding columns and when the data is ready, throw it in a Output Table
with a large commit size.
Upvotes: 0
Reputation: 6023
probably better suited for dba.stackexchange.com. But I guess a database engine is going to perform this task much faster, because a) it can optimize access to all tables involved using indexes and table statistics and b) you get rid of the overhead an ETL tool and multiple database queries introduce. Pentaho PDI processes rows individually, so for each row coming from your table input step you will have an SQL query for every lookup step.
Upvotes: 1