Reputation: 1197
I am new to Pig and still exploring efficient ways to do simple things. For example, I have a bag of events
{"events":[{"event": ev1}, {"event": ev2}, {"event":ev3}, ....]}
And I want to collapse that as just a tuple, something like
{"events":[ev1, ev2, ev3, ....]}
Is there a way to achieve this in Pig? I have veen struggling through this for a while, but without much success :(.
Thanks in advance
Upvotes: 0
Views: 1085
Reputation: 1197
Thanks all for informative comments. They helped me.
However, I found I was missing an important feature of a Schema
, namely, every field has a key, and a value (a map). So now I achieve what I wanted by writing a UDF converting the bag to a comma separated string of values:
package BagCondenser;
import java.io.IOException;
import java.util.Iterator;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.DataBag;
import org.apache.pig.data.Tuple;
public class BagToStringOfCommaSeparatedSegments
extends EvalFunc<String> {
@Override
public String exec(Tuple input) throws IOException {
// Condensed bag to be returned
String listOfSegmentIds = new String("");
// Cast the input to a bag
Object inputObject = input.get(0);
// Throw error if not bag-able input
if (!(inputObject instanceof DataBag))
throw new IOException("Expected input to be a bag, but got: "
+ inputObject.getClass());
// The input bag
DataBag bag = (DataBag) inputObject;
Iterator it = bag.iterator();
// Collect second fields of each tuple and add to the output bag
while(it.hasNext()) {
// If the return string already had values, append a ','
if ( ! listOfSegmentIds.equals("") )
listOfSegmentIds += ",";
Tuple tuple = (Tuple) it.next();
listOfSegmentIds += tuple.get(0).toString();
}
return listOfSegmentIds;
}
}
Upvotes: 0
Reputation: 5184
Looking at your input it seems that your schema is something like:
A: {name:chararray, vals:{(inner_name:chararray, inner_value:chararray)}}
As I mentioned in a comment to your question, actually turning this into an array of nothing but inner_value
s will be extremely difficult since you don't know how many fields you could potentially have. When you don't know the number of fields you should always try to use a bag in Pig.
Luckily, if you can in fact use a bag for this it is trivial:
-- Project out only inner_value from the bag vals
B = FOREACH A GENERATE name, vals.inner_value ;
Upvotes: 0