Reputation: 1507
I've got two data sets (csv files) which I bring into a Pentaho DI transformation using a "CSV file input" step. File A has a column joincol
whose values appear in file B in column joincol
. I want to left join the data in file A with the data in file B on the condition A.joincol = B.joincol
to get the corresponding data from column wantcol
from file B.
Pretty simple stuff.
As a (necessary) preliminary step I sort both datasets on the columns to be joined on (namely A.joincol
and B.joincol
) ascending.
I then use a "Merge Join" step with the sorted data from file A as the "First Step" and the sorted data from file B as the "Second Step", and select the "Join Type" to be LEFT OUTER. I select "Get key fields" under the "Keys for 1st step" section and specify that joincol
is the only key field to join on, then do the same for "Keys for 2nd step".
So, the join partially works except:
joincol
when it is sorted ascending.I've confirmed that the values in file A's joincol
all appear in file B's joincol
by doing a vlookup in Excel. In any case, even if none of the values appeared in file B a left outer join should still leave me with at least all the rows I had going in.
I am new to Pentaho so I find it hard to believe that I stumbled upon such a crazy bug so early on in my exploration, which means I must be missing a step. However, I read the documentation for the merge join and believe I have followed all necessary steps (the ones I've laid out here). Can anyone let me know if I've missed something? Perhaps this is the result of a common error?
Thanks very much!
Upvotes: 0
Views: 1313
Reputation: 6356
Joins is typical of SQL not of PDI.
With PDI, there is a main stream of data, and for each row you look for a correspondence in the other stream and return the required columns.
If you look for one value, for example the Customer of an Order, use the Lookup/Lookup Stream
step. When you expect some values to be missing (LEFT JOIN), you can use a default. When you want only row with a correspondence (INNER JOIN), you can filter out the default you just put.
If you need to return many rows, for example all the Orders of a Customer, you may the Joins/Merge Join
. From experience, it is usually easier to reverse the question.
In case you cannot, you are in the case of a FULL OUTER JOIN, you have to use Joins/Join Rows (Cartesian Product)
, in which you can add one or many conditions.
Upvotes: 1