PIG UDF to convert tuple to multiple tuple output

Question

I am new to PIG and I am trying to create a UDF which get a tuple and return multiple tuple based on a delimited. So I have written one UDF to read the below data file

2012/01/01 Name1 Category1|Category2|Category3
2012/01/01 Name2 Category2|Category3
2012/01/01 Name3 Category1|Category5

Basically i am trying to read $2 field

Category1|Category2|Category3
Category2|Category3
Category1|Category5

to get the output as :-

Category1, Category2, Category3
Category2, Category3
Category1, Category5

Below is the UDF code i have written..

    package com.test.multipleTuple;    
    import java.io.IOException;
    import org.apache.pig.EvalFunc;
    import org.apache.pig.data.Tuple;
    import org.apache.pig.data.TupleFactory;

    public class TupleToMultipleTuple extends EvalFunc {

        @Override
        public String exec(Tuple input) throws IOException {

            // Keep the count of every cell in the
            Tuple aux = TupleFactory.getInstance().newTuple();

            if (input == null || input.size() == 0)
                return null;
            try {
                String del = "\|";
                String str = (String) input.get(0);

                String field[] = str.split(del);
                for (String nxt : field) {
                    aux.append(nxt.trim().toString());
                }
            } catch (Exception e) {
                throw new IOException("Caught exception processing input row ", e);
            }

            return aux.toDelimitedString(",");
        }
    }

created Jar --> TupleToMultipleTuple.jar

But I am getting the below error while executing it .

 Pig Stack Trace
    ---------------
    ERROR 1066: Unable to open iterator for alias B

    org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to open iterator for alias B
        at org.apache.pig.PigServer.openIterator(PigServer.java:892)
        at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:774)
        at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:372)
        at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:198)
        at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:173)
        at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:69)
        at org.apache.pig.Main.run(Main.java:547)
        at org.apache.pig.Main.main(Main.java:158)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
    Caused by: java.io.IOException: Job terminated with anomalous status FAILED
        at org.apache.pig.PigServer.openIterator(PigServer.java:884)
        ... 13 more

Can you please help me in rectifying the issue. Thanks.

Pig script for applying the UDF..

REGISTER TupleToMultipleTuple.jar;
DEFINE myFunc com.test.multipleTuple.TupleToMultipleTuple();
A = load 'data.txt' USING PigStorage(' ');
B = foreach A generate myFunc($2);
dump B;

54l3d · Accepted Answer

You can use the built-in split function like this:

flatten(STRSPLIT($2,'[|]',3))as(cat1:chararray,cat2:chararray,cat3:chararray)

and you will get 3 tuples named cat1, cat2 and cat2 typed as chararray and delimited by the current delimiter of the relation which they belong to.

PIG UDF to convert tuple to multiple tuple output

Answers (2)

Related Questions