Reputation: 307
I'm doing some transformations on some data set and need to publish to a sane looking format. Current my final set looks like this when I run describe:
{memberId: long,companyIds: {(subsidiary: long)}}
I need it to look like this:
{memberId: long,companyIds: [long] }
where companyIds
is the key to an array of ids of type long?
I'm really struggling with how to manipulate things in this way? Any ideas? I've tried using FLATTEN
and other commands to know avail. I'm using AvroStorage to write the files into this schema:
The field schema I need to write this data to looks like this:
"fields": [
{ "name": "memberId", "type": "long"},
{ "name": "companyIds", "type": {"type": "array", "items": "int"}}
]
Upvotes: 1
Views: 2562
Reputation: 26
I know this is a bit old, but I recently ran into the same problem.
Based on the avrostorage documentation, using the latest version of pig and avrostorage, it is possible to directly cast bag to avro array.
In your case, you may want something like:
STORE blah INTO 'blah' USING AvroStorage('schema','{your schema}');
where the array field in the schema is
{
"name":"companyIds",
"type":[
"null",
{
"type":"array",
"items":"long"
}
],
"doc":"company ids"
}
Upvotes: 1
Reputation: 1855
There is no array type in PIG (http://pig.apache.org/docs/r0.10.0/basic.html#data-types). However, if all you need is a good looking output and if you don't have too many elements in companyIds, you may want to write a simple UDF that converts the bag into a nice formatted string.
Java code
public class BagToString extends EvalFunc<String>
{
@Override
public String exec(Tuple input) throws IOException
{
List<String> strings = new ArrayList<String>();
DataBag bag = (DataBag) input.get(0);
if (bag.size() == 0) {
return null;
}
for (Iterator<Tuple> it = bag.iterator(); it.hasNext();) {
Tuple t = it.next();
strings.add(t.get(0).toString());
}
return StringUtils.join(strings, ":");
}
}
PIG script
foo = foreach bar generate memberId, BagToString(companyIds);
Upvotes: 2