Nikhil J Joshi
Nikhil J Joshi

Reputation: 1197

Pig: Bag to Tuple

I am new to Pig and still exploring efficient ways to do simple things. For example, I have a bag of events

{"events":[{"event": ev1}, {"event": ev2}, {"event":ev3}, ....]}

And I want to collapse that as just a tuple, something like

{"events":[ev1, ev2, ev3, ....]}

Is there a way to achieve this in Pig? I have veen struggling through this for a while, but without much success :(.

Thanks in advance

Upvotes: 0

Views: 1085

Answers (2)

Nikhil J Joshi
Nikhil J Joshi

Reputation: 1197

Thanks all for informative comments. They helped me.

However, I found I was missing an important feature of a Schema, namely, every field has a key, and a value (a map). So now I achieve what I wanted by writing a UDF converting the bag to a comma separated string of values:

 package BagCondenser;

 import java.io.IOException;
 import java.util.Iterator;

 import org.apache.pig.EvalFunc;
 import org.apache.pig.data.DataBag;
 import org.apache.pig.data.Tuple;


 public class BagToStringOfCommaSeparatedSegments
     extends EvalFunc<String> {

    @Override
    public String exec(Tuple input) throws IOException {

          // Condensed bag to be returned
          String listOfSegmentIds = new String("");

          // Cast the input to a bag
          Object inputObject = input.get(0);

          // Throw error if not bag-able input
          if (!(inputObject instanceof DataBag))
              throw new IOException("Expected input to be a bag, but got: "
                   + inputObject.getClass());

          // The input bag
          DataBag bag = (DataBag) inputObject;
          Iterator it = bag.iterator();

         // Collect second fields of each tuple and add to the output bag
         while(it.hasNext()) {
             // If the return string already had values, append a ','
             if ( ! listOfSegmentIds.equals("") )
                 listOfSegmentIds += ",";

             Tuple tuple = (Tuple) it.next();

             listOfSegmentIds += tuple.get(0).toString();
        }

        return listOfSegmentIds;

    }

 }

Upvotes: 0

mr2ert
mr2ert

Reputation: 5184

Looking at your input it seems that your schema is something like:

A: {name:chararray, vals:{(inner_name:chararray, inner_value:chararray)}}

As I mentioned in a comment to your question, actually turning this into an array of nothing but inner_values will be extremely difficult since you don't know how many fields you could potentially have. When you don't know the number of fields you should always try to use a bag in Pig.

Luckily, if you can in fact use a bag for this it is trivial:

-- Project out only inner_value from the bag vals
B = FOREACH A GENERATE name, vals.inner_value ;

Upvotes: 0

Related Questions