Alexandre Tranchant
Alexandre Tranchant

Reputation: 4571

Talend: Equivalent of logstash "key value" filter

I'm discovering Talend Open Source Data Integrator and I would like to transform my data file into a csv file.

My data are some sets of key value data like this example:

A=0 B=3 C=4
A=2 C=4
A=2 B=4
A= B=3 C=1

I want to transform it into a CSV like this one:

A,B,C
0,3,4
2,,4
2,4,

With Logstash, I was using the "key value" filter which is able to do this job with a few lines of code. But with Talend, I don't find a similar transformation. I tried a "delimiter file" job and some other jobs without success.

Upvotes: 1

Views: 346

Answers (2)

Ibrahim Mezouar
Ibrahim Mezouar

Reputation: 4061

Corentin's answer is excellent, but here's an enhanced version of it, which cuts down on some components:

enter image description here

Instead of using tFileInputRaw and tConvertType, I used tFileInputFullRow, which reads the file line by line into a string.
Instead of splitting the string manually (where you need to check for nulls), I used tExtractDelimitedFields with "=" as a separator in order to extract a key and a value from the "key=value" column.
The end result is the same, with an extra column at the beginning.
If you want to delete the column, a dirty hack would be to read the output file using a tFileInputFullRow, and use a regex like ^[^;]+; in a tReplace to replace anything up to (and including) the first ";" in the line with an empty string, and write the result to another file.

Upvotes: 1

Corentin
Corentin

Reputation: 2552

This is quite tricky and interesting, because Talend is schema-based, so if you don't have the input/output schema predefined, it could be quite hard to achieve what you want.

Here is something you can try, there is a bunch of components to use, I didn't manage to get to a solution with fewer components. My solution is using unusual components like tNormalize and tPivotToColumnsDelimited. There is one flaw, as you'll get an extra column in the end.

enter image description here

1 - tFileInputRaw, because if you don't know your input schema, just read the file with this one.

2 - tConvertType : here you can convert Object to String type

3 - tNormalize : you'll have to separate manually your lines (use \n as separator)

4 - tMap : add a sequence "I"+Numeric.sequence("s1",1,1) , this will be used later to identify and regroup lines.

5 - tNormalize : here I normalize on 'TAB' separator, to get one line for each key=value pair

6 - tMap : you'll have to split on "=" sign.

enter image description here

At this step, you'll have an output like :

|seq|key|value|
|=--+---+----=|
|I1 |A  |1    |
|I1 |B  |2    |
|I1 |C  |3    |
|I2 |A  |2    |
|I2 |C  |4    |
|I3 |A  |2    |
|I3 |B  |4    |
'---+---+-----'

where seq is the line number.

7 - Finally, with the tPivotToColumnDelimited, you'll have the result. Unfortunately, you'll have the extra "ID" column, as the output schema provided by the component tPivot is not editable. (the component is creating the schema, actually, which is very unusual amongst the talend components). Use ID column as the regroup column.

enter image description here

Hope this helps, again, Talend is not a very easy tool if you have dynamic input/output schemas.

Upvotes: 2

Related Questions