What is the difference between these two Pig data types?

Question

Given this example:

describe A;
A: {ht.udf.cleanlog_log_5: (ip: chararray,property_id: int)}

My understanding is that A is a bag made of tuples of type ht.udf.cleanlog_log_5.
(correct?)

When I apply this transformation:

B = FOREACH A GENERATE FLATTEN($0);
describe B;
B: {ht.udf.cleanlog_log_7::ip: chararray,ht.udf.cleanlog_log_7::property_id: int}

What is B?
Is it a bag with unnamed tuples?
Where each tuple has two named fields? (i.e. ht.udf.cleanlog_log_7 and ht.udf.cleanlog_log_7)

Thank you

reo katoa · Accepted Answer

Your understanding is almost correct. The issue is that when you use DESCRIBE, Pig doesn't actually explicitly mark the large "tuple" which is the entire record.

As you said, relations are bags of tuples. When you read a file, for example, the bag encompasses all the data in the file and each line of the file is a tuple. In your case, A is a bag of tuples. Each tuple in the bag has one element, ht.udf.cleanlog_log_5. This element is also a tuple, with two elements, ip and property_id.

Now, when you use FLATTEN on a tuple, it "promotes" the elements of that tuple to be elements of the containing tuple. So B is a bag of tuples. Each tuple of the bag has two elements, ht.udf.cleanlog_log_7::ip and ht.udf.cleanlog_log_7::property_id.

A more correct way of DESCRIBEing a relation's schema would be to show this tuple, like

describe A;
A: {(ht.udf.cleanlog_log_5: (ip: chararray,property_id: int))}
describe B;
B: {(ht.udf.cleanlog_log_7::ip: chararray,ht.udf.cleanlog_log_7::property_id: int)}

But these tuples are never named and can never be referred to, so there is no use in showing them.

What is the difference between these two Pig data types?

Answers (1)

Related Questions