Reputation: 4490
Given this example:
describe A;
A: {ht.udf.cleanlog_log_5: (ip: chararray,property_id: int)}
My understanding is that A is a bag made of tuples of type ht.udf.cleanlog_log_5
.
(correct?)
When I apply this transformation:
B = FOREACH A GENERATE FLATTEN($0);
describe B;
B: {ht.udf.cleanlog_log_7::ip: chararray,ht.udf.cleanlog_log_7::property_id: int}
What is B?
Is it a bag with unnamed tuples?
Where each tuple has two named fields? (i.e. ht.udf.cleanlog_log_7
and ht.udf.cleanlog_log_7
)
Thank you
Upvotes: 2
Views: 238
Reputation: 5801
Your understanding is almost correct. The issue is that when you use DESCRIBE
, Pig doesn't actually explicitly mark the large "tuple" which is the entire record.
As you said, relations are bags of tuples. When you read a file, for example, the bag encompasses all the data in the file and each line of the file is a tuple. In your case, A
is a bag of tuples. Each tuple in the bag has one element, ht.udf.cleanlog_log_5
. This element is also a tuple, with two elements, ip
and property_id
.
Now, when you use FLATTEN
on a tuple, it "promotes" the elements of that tuple to be elements of the containing tuple. So B
is a bag of tuples. Each tuple of the bag has two elements, ht.udf.cleanlog_log_7::ip
and ht.udf.cleanlog_log_7::property_id
.
A more correct way of DESCRIBE
ing a relation's schema would be to show this tuple, like
describe A;
A: {(ht.udf.cleanlog_log_5: (ip: chararray,property_id: int))}
describe B;
B: {(ht.udf.cleanlog_log_7::ip: chararray,ht.udf.cleanlog_log_7::property_id: int)}
But these tuples are never named and can never be referred to, so there is no use in showing them.
Upvotes: 3