Reputation: 257
I am new to Pentaho and am trying to read a CSV file (which I already did) and create blocks of data based on an identifier.
Eg
1|A|B|C
2|D|E|F
8|G|H|I|J|K
4|L|M
1|N|O|P
4|Q|R|S|T
5|U|V|W
I need to split and group this as such:
(each block starts when the first column is equal to '1')
Block a)
1|A|B|C
2|D|E|F
8|G|H|I|J|K
4|L|M
Block b)
1|N|O|P
4|Q|R|S|T
5|U|V|W
Eg
a |1|A|B|C
a |2|D|E|F
a |8|G|H|I|J|K
a |4|L|M
b |1|N|O|P
b |4|Q|R|S|T
b |5|U|V|W
How can this be achieved using Penatho? Thanks.
I found a similar question but answers don't really help my case Pentaho Kettle split CSV into multiple records
Upvotes: 0
Views: 2696
Reputation: 383
I think I got the answer.
I created the transformation in this zip that can transform your "csv" file in rows almost like you described but I don't know what you intend to do next, so maybe you can give us more details. =)
I'll explain what I did:
1) First, we grab the row full text with a Text input step
When you look at configurations of Text Input step, you'll see I used a ';' has separator, when your input file uses '|' so I'm not spliting columns with the '|' but loading the whole line in one column. Grabbing the row's full text, nothing else.
2) Next we apply a regex eval to separate the ID from the rest of our string.
^(\d+)\|(.*)
Which means: in the beginning of the text I expect one or more digits followed by a pipe and anything after that. Capture the digits in the beginning of the string in one column and everything after the pipe to another column.
That gives you this output: (blue is the first capture group, red is the second)
3) Now what you need is to add a 'sequence' that only goes up if there is a row_id = 1. Which I did in the Mod JS Value with the following code:
var sequence
//if it's the first row, set sequence to 1
if(sequence == null){
sequence = 1;
}else{
//if it's not the first row, check if the row_id is equal to 1 (string)
if(row_id == '1'){
// increment the sequence
sequence++;
}else{
//nothing
}
}
And that will give you this output that seem to be what you expected: (green, the group/sequence done)
Hope it helps =)
Upvotes: 4