Reputation: 438
Let's take the wordCount example:
input_lines = LOAD '/tmp/my-copy-of-all-pages-on-internet' AS (line:chararray);
-- Extract words from each line and put them into a pig bag
-- datatype, then flatten the bag to get one word on each row
bag_words = FOREACH input_lines GENERATE FLATTEN(TOKENIZE(line)) AS word;
Is it possible to serialize the "bag_words" variable so that we don't have to rebuild the entire bag each time we want to execute the script ?
Thanks.
Upvotes: 0
Views: 244
Reputation: 1028
STORE bag_words INTO 'some-output-directory';
Then read it in later to skip the foreach generate, flatten, tokenize.
Upvotes: 2
Reputation: 22905
You can output any alias in pig using the STORE command: you could use standard formats (like CSV) or write your own PigLoader class to implement any specific behaviour. You can then LOAD this output in a separate script, thus bypassing the initial LOAD.
Upvotes: 0