McLovin
McLovin

Reputation: 1601

Talend - Transposing Input File with Dynamic Number of Columns

I have a very tricky situation and have yet to come up with my own solution using Talend. I have an input csv file that has a dynamic number of columns and need to transpose this file to create a desired output file. The input file looks like this:

ID1 | Units1 | Count | Val1a | Val2a | Val3a | Val1b | Val2b | Val3b | Val1c | Val2c | Val3c
ID2 | Units2 | Count | Val1d | Val2d | Val3d | Val1e | Val2e | Val3e

As you can see, the Val's are in triplets on each row and I need the triplets to stay together. The Count column holds a number of how many triplets are found on that row as each row is dynamic. The final output I need is:

ID1 | Units1 | Val1a | Val2a | Val3a
ID1 | Units1 | Val1b | Val2b | Val3b
ID1 | Units1 | Val1c | Val2c | Val3c
ID2 | Units2 | Val1d | Val2d | Val3d 
ID2 | Units2 | Val1e | Val2e | Val3e

I have tried pivoting and unpivoting the rows but this does not allow me to keep the triplets together. Is it possible for me to send the input file through a Java script where I parse the file and output to a csv file in this format?

Any help is much appreciated.

Upvotes: 1

Views: 3964

Answers (1)

ydaetskcoR
ydaetskcoR

Reputation: 56877

You should be able to write a custom bit of Java code in a tJavaRow that should parse out the values as you need them.

You'll probably want to read the original source in as a raw input using tFileInputRaw so you can then parse the entire thing in your tJavaRow component. You'll want to add an output schema to the tJavaRow that has columns for your id, unit, value1, value2 and value3.

Because you've got lines that need to be broken down into multiple lines you are probably best off concatenating the values such as 1a, 1b, 1c and 1d together which you can then further break into separate lines using the tNormalize component which will break the row down as needed.

Off the top of my head the code in your tJavaRow component could look something like:

// First we split the data on the delimiter of the file, assuming comma separated for now
String[] splitData = inputrow.content.split(",");
// Initialise the three value strings
String value1 = ""
String value2 = ""
String value3 = ""
for (int i = 0; i < splitData.length; i++) {
    if (i == 1) {
        // id field is the first field
        String id = splitData[i];
    } else if (i == 2) {
        unit field is the second field
        String unit = splitData[i];
    } else if (i == 3) {
        // Don't need to do anything with the count data
    } else if (i % 3 == 1) {
        // value1 fields are 4, 7, 10 etc so modulo 3 == 1
        if (value1.length == 0) {
            value1 + splitData[i];
        } else {
            // if the value field isn't empty (ie. for val1b) then we need to add a delimiter for further processing
            value1 + "|" + splitData[i];
        }
    } else if (i % 3 == 1) {
        if (value2.length == 0) {
            value2 + splitData[i];
        } else {
            value2 + "|" + splitData[i];
        }
    } else if (i % 3 == 3) {
        if (value3.length == 0) {
            value3 + splitData[i];
        } else {
            value3 + "|" + splitData[i];
        }
    }
}

// Finally we then assign these strings to their output schema columns
output_row.id = id;
output_row.unit = unit;
output_row.value1 = value1;
output_row.value2 = value2;
output_row.value3 = value3;

You'll have some issues with this if the contents of any of your fields can contain commas (or whatever delimiter your file has, as you will need to make a more complicated parsing method that allows for quoted field data.

From here you just need to connect the tNormalize component and have it split on the | character (or whatever you chose to instead, it should be a character that will not be found in your source data). You might need to do this three times to get your desired output (as you have three columns to normalise on).

Upvotes: 1

Related Questions