Newtang
Newtang

Reputation: 6544

In Pig, Flattening a bag into a single line

In my Pig script (0.9.2), my final output looks like this:

final: {email: chararray,{(name: chararray,percent: double)}}

where for every email address I have up to 3 names and scores. So, the output would look something like this:

[email protected] {(Joe Smith, 0.5),(Joseph, 0.1), (Joey, 0.1)}

What I'd really like to do is flatten this into tabbed delimted values (no parentheses or curly braces) to make it easier to pull into a MySQL table like this:

[email protected] Joe Smith 0.5 Joseph 0.1 Joey 0.1

How can I accomplish this in Pig? Or do I have to write a custom UDF?

Upvotes: 2

Views: 3011

Answers (2)

Newtang
Newtang

Reputation: 6544

I wrote a Java UDF that works pretty well for a bag of tuples. The Tuple.toDelimitedString is the key.

public class BagToString extends EvalFunc<String> {

    @Override
    public String exec(Tuple input) throws IOException {
        DataBag bag = (DataBag) input.get(0);
        Iterator<Tuple> bagIT = bag.iterator();
        String delimiter = "\t";

        StringBuilder sb = new StringBuilder();
        while(bagIT.hasNext()){
            Tuple tupleInBag = bagIT.next();
            sb.append(tupleInBag.toDelimitedString(delimiter)).append(delimiter);
        }

        return sb.toString();

    }
}

Upvotes: 3

Eli
Eli

Reputation: 38949

You'll need to write a custom udf for this. You can do that easily in a language like Python. Just do something like:

@outputSchema("flat_bag:bag{}")
def flattenBag(bag):
    flat_bag = [item for tup in bag for item in tup]
    return flat_bag

Just throw that into a .py file and load it like:

REGISTER '/path/to/udfs.py' using jython as py_funcs;

And then use it like:

final1 = FOREACH final GENERATE email, py_funcs.flattenBag($1);

Upvotes: 5

Related Questions