John Sylvester
John Sylvester

Reputation: 257

How to split a CSV file into groups using Pentaho?

I am new to Pentaho and am trying to read a CSV file (which I already did) and create blocks of data based on an identifier.

Eg

1|A|B|C
2|D|E|F
8|G|H|I|J|K
4|L|M
1|N|O|P
4|Q|R|S|T
5|U|V|W

I need to split and group this as such:
(each block starts when the first column is equal to '1')

Block a)

1|A|B|C
2|D|E|F
8|G|H|I|J|K
4|L|M

Block b)

1|N|O|P
4|Q|R|S|T
5|U|V|W

Eg

 a |1|A|B|C
 a |2|D|E|F
 a |8|G|H|I|J|K
 a |4|L|M

 b |1|N|O|P
 b |4|Q|R|S|T
 b |5|U|V|W

How can this be achieved using Penatho? Thanks.

I found a similar question but answers don't really help my case Pentaho Kettle split CSV into multiple records

Upvotes: 0

Views: 2696

Answers (1)

MrMauricioLeite
MrMauricioLeite

Reputation: 383

I think I got the answer.

I created the transformation in this zip that can transform your "csv" file in rows almost like you described but I don't know what you intend to do next, so maybe you can give us more details. =)

I'll explain what I did:

1) First, we grab the row full text with a Text input step

Transformation

When you look at configurations of Text Input step, you'll see I used a ';' has separator, when your input file uses '|' so I'm not spliting columns with the '|' but loading the whole line in one column. Grabbing the row's full text, nothing else.

2) Next we apply a regex eval to separate the ID from the rest of our string.

^(\d+)\|(.*)

Which means: in the beginning of the text I expect one or more digits followed by a pipe and anything after that. Capture the digits in the beginning of the string in one column and everything after the pipe to another column.

That gives you this output: (blue is the first capture group, red is the second) regex

3) Now what you need is to add a 'sequence' that only goes up if there is a row_id = 1. Which I did in the Mod JS Value with the following code:

var sequence

//if it's the first row, set sequence to 1
if(sequence == null){
    sequence = 1;
}else{
//if it's not the first row, check if the row_id is equal to 1 (string)
    if(row_id == '1'){
        // increment the sequence
        sequence++;
    }else{
        //nothing
    }
}

And that will give you this output that seem to be what you expected: (green, the group/sequence done)

sequence

Hope it helps =)

Upvotes: 4

Related Questions