Edmondo
Edmondo

Reputation: 20090

Processing a single group of rows at once in Talend Open Studio for Data Integration

I have a data source where each row has five fields:

company name; year; code; value;

In my target output row model I want to produce a row like so

company name;year;value1;value2;value3;value4

Where value1,value..N are not concatenation for a single code but rather a "mapping". I.e. code 50 => "Total Revenues"

So I need to perform the following logic:

  1. First select all the records for the same company name / year
  2. Then apply some custom java logic that performs the mapping between my codes and my fields of the output row.

This is a in-memory map reduce with about 1M rows. How should this be handled in Talend Open Studio for Data Integration?

Upvotes: 2

Views: 2455

Answers (3)

Kriegel
Kriegel

Reputation: 433

You could do it like that, and it's essentially the approach Maira Bay already suggested:

  1. Set up your data source to emit those lines one at a time. I used tFixedFlowInput for that. You'd probably have to read from a file.
  2. Optionally sort by company name and year with a tSortRow.
  3. Map with a tMap the value of each line to the corresponding column in the result line with a guard clause like input.code.equals("code for this column") ? input.value : null.
  4. Aggregate the rows with a tAggregateRow, grouping by company name and year, selecting the first value for each of the value rows - but make sure to ignore the nulls.
  5. Do anything you want with the resulting lines.

I tried that with some sample data, hence the tFixedFlowInput in step 1, and it worked for me on my machine in TOS 6.3.1.

Beware: the solution proposed assumes you only got one value per combination of company name, year and code.

Upvotes: 2

Maira Bay
Maira Bay

Reputation: 300

select all the records for the same company name / year

You might want to use tAggregate (https://help.talend.com/display/TalendOpenStudioComponentsReferenceGuide521EN/18.1+tAggregateRow) to group the flow by company name and year

apply some custom java logic that performs the mapping between my codes and my fields of the output row.

Talend has a component called tMap that allows you to map input fields into output fields.

In your tMap you can use something like:

(assuming that input is the name of the flow into your tMap and output is the name of your flow out of your tMap)

In output.field1 put input.code == 50? input.value : 0

In output.field2 put input.code == 60? input.value : 0

In output.field2 put input.code == 70? input.value : 0

etc

This is assuming you are ok with leaving the field columns with 0 if the value was for another code.

If you want the value for each code to be in a different output row out of the tMap you can use a logic similar to the above, only putting each test (code == 70? input.value : 0) in a different output table, and then filtering out the rows that have 0 (using a tFilter) after the tMap.

To add output tables you can use the + symbol on the top right of the tMap.

See here for more detials on how to use tMap: https://help.talend.com/display/TalendOpenStudioComponentsReferenceGuide54EN/tMap

I hope this helps!

Upvotes: 2

E LaRoche
E LaRoche

Reputation: 1144

See solution below which I believe will fulfill your precise requirement of taking a delimited file data source and transforming it into a denormalized out as specified above.

First I mocked up a file with the same format as you specified. I made the values a logical concatenation of Company, Year, and sequence. This makes it easy to verify the output.

enter image description here

Next I use that as an input, run it thru a sorter, then denormalize on the value field. Finally you can see the output in a tLogRow.

enter image description here

I also included the component view of tDenormalize so you can see how that is done. You can use this technique in any falvor of Talend Open Studio.
enter image description here

Upvotes: 0

Related Questions