Rookatu
Rookatu

Reputation: 1507

Left Outer join drops rows in Pentaho DI

I've got two data sets (csv files) which I bring into a Pentaho DI transformation using a "CSV file input" step. File A has a column joincol whose values appear in file B in column joincol. I want to left join the data in file A with the data in file B on the condition A.joincol = B.joincol to get the corresponding data from column wantcol from file B.

Pretty simple stuff.

As a (necessary) preliminary step I sort both datasets on the columns to be joined on (namely A.joincol and B.joincol) ascending.

I then use a "Merge Join" step with the sorted data from file A as the "First Step" and the sorted data from file B as the "Second Step", and select the "Join Type" to be LEFT OUTER. I select "Get key fields" under the "Keys for 1st step" section and specify that joincol is the only key field to join on, then do the same for "Keys for 2nd step".

So, the join partially works except:

I've confirmed that the values in file A's joincol all appear in file B's joincol by doing a vlookup in Excel. In any case, even if none of the values appeared in file B a left outer join should still leave me with at least all the rows I had going in.

I am new to Pentaho so I find it hard to believe that I stumbled upon such a crazy bug so early on in my exploration, which means I must be missing a step. However, I read the documentation for the merge join and believe I have followed all necessary steps (the ones I've laid out here). Can anyone let me know if I've missed something? Perhaps this is the result of a common error?

Thanks very much!

Upvotes: 0

Views: 1313

Answers (1)

AlainD
AlainD

Reputation: 6356

Joins is typical of SQL not of PDI.

With PDI, there is a main stream of data, and for each row you look for a correspondence in the other stream and return the required columns.

If you look for one value, for example the Customer of an Order, use the Lookup/Lookup Stream step. When you expect some values to be missing (LEFT JOIN), you can use a default. When you want only row with a correspondence (INNER JOIN), you can filter out the default you just put.

If you need to return many rows, for example all the Orders of a Customer, you may the Joins/Merge Join. From experience, it is usually easier to reverse the question.

In case you cannot, you are in the case of a FULL OUTER JOIN, you have to use Joins/Join Rows (Cartesian Product), in which you can add one or many conditions.

Upvotes: 1

Related Questions