Reputation: 439
I'm filtering a table by a list of about 20 IDs. Right now my code looks like this:
A = LOAD 'ids.txt' USING PigStorage();
B = LOAD 'massive_table' USING PigStorage();
C = JOIN A BY $0, B BY $0;
D = FOREACH C GENERATE $1, $2, $3, $4, ...
STORE D INTO 'foo' USING PigStorage();
What I don't like is line D, where I have to regenerate a new table to get rid of the joining column by explicitly declaring every single other column I want present (and sometimes that is a lot of columns). I'm wondering if there's something equivalent to:
FILTER B BY $0 IN (A)
or:
DROP $0 FROM C
Upvotes: 4
Views: 11053
Reputation: 21563
If you would want to drop column number 5, you could do it like so:
D = FOREACH C GENERATE .. $4, $6 .. ;
If you want to drop a column by name, it does not appear possible by only knowing the name of the column that you want to drop. However, it is possible if you know the names of the columns directly before and after this column. If you want to drop the column(s) between colBeforeMyCol and colAfterMyCol, you could do it like so:
aliasAfter = FOREACH aliasBefore GENERATE
.. colBeforeMyCol, colAfterMyCol ..;
Upvotes: 3
Reputation: 30089
Maybe similiar-ish to this question:
That references a JIRA ticket: https://issues.apache.org/jira/browse/PIG-1693 which examples how you can use the .. notation to denote all the remaining fields:
D = FOREACH C GENERATE $1 .. ;
This assumes you have 0.9.0+ PIG
Upvotes: 9