Reputation: 11
I do the following:
a = load '/hive/warehouse/' USING PigStorage('^') as (a1,b1,c1);
b = group a by (a1) ;
c = foreach b generate group, a.$2;
dump c;
Output shows all the groups:
abc {(1),(44),(66)}
cde {(1),(44),(66)}
How can I remove "{" and "(" characters so that the final HDFS file can be read as a coma delimited file?
Upvotes: 1
Views: 4153
Reputation: 11
This functionality is now provided in Pig as a built-in func (I'm using 0.11).
http://pig.apache.org/docs/r0.11.0/api/org/apache/pig/builtin/BagToString.html
c = foreach b generate group, a.$2 as stuff;
d = foreach c generate group, BagToString(stuff, ',');
I don't need a comma-delimited file for my use case, but I assume you can use a store func to get the final comma (between group and the now-comma-delimited-list of bag things).
Upvotes: 1
Reputation: 5811
You can't do this directly in Pig. The special syntax is required because you are storing a bag, and in order for Pig to be able to read this bag later, it needs to be stored with braces (for the bag) and parentheses (for the tuples contained in the bag).
You have a couple of options. You can read the file back into Pig, but instead of reading it as a bag
, read it as a chararray
. Then you can perform regex substitution to get rid of the punctuation (untested):
a = LOAD 'output' AS (group:chararray, list:chararray);
b = FOREACH A GENERATE group, REPLACE(list, '[{()}]', '');
Another option is to write a UDF which will turn a bag into a tuple. Note that this is not a well-defined operation: bags have no particular order, so from one run to the next, your tuple is not guaranteed to be in the same order. But for your purposes it sounds like that may not matter. The UDF could look like (very rough draft, untested):
public class BAG_TO_TUPLE extends EvalFunc(Tuple) {
public Tuple exec(Tuple input) {
DataBag bag = input.get(0);
Iterator<Tuple> iterator = bag.iterator();
Tuple out = new DefaultTuple();
while(iterator.hasNext()) {
out.append(iterator.next().get(0));
}
return out;
}
}
The above UDF is terrible -- it assumes that you have exactly one element in every tuple of the bag (that you care about) and does no checking whatsoever that the input is valid, etc. But it should get you towards what you want.
The best solution, though, is to find a way to handle the extra punctuation outside of Pig if Pig is not part of your downstream processing.
Upvotes: 3
Reputation: 452
Try the FLATTEN operator;
c = foreach b generate group, FLATTEN(a.$2);
Upvotes: 0