Clarke
Clarke

Reputation: 295

Writing one file per group in Pig Latin

The Problem: I have numerous files that contain Apache web server log entries. Those entries are not in date time order and are scattered across the files. I am trying to use Pig to read a day's worth of files, group and order the log entries by date time, then write them to files named for the day and hour of the entries it contains.

Setup: Once I have imported my files, I am using Regex to get the date field, then I am truncating it to hour. This produces a set that has the record in one field, and the date truncated to hour in another. From here I am grouping on the date-hour field.

First Attempt: My first thought was to use the STORE command while iterating through my groups using a FOREACH and quickly found out that is not cool with Pig.

Second Attempt: My second try was to use the MultiStorage() method in the piggybank which worked great until I looked at the file. The problem is that MulitStorage wants to write all fields to the file, including the field I used to group on. What I really want is just the original record written to the file.

The Question: So...am I using Pig for something it is not intended for, or is there a better way for me to approach this problem using Pig? Now that I have this question out there, I will work on a simple code example to further explain my problem. Once I have it, I will post it here. Thanks in advance.

Upvotes: 2

Views: 1105

Answers (1)

Donald Miner
Donald Miner

Reputation: 39893

Out of the box, Pig doesn't have a lot of functionality. It does the basic stuff, but more times than not I find myself having to write custom UDFs or load/store funcs to get form 95% of the way there to 100% of the way there. I usually find it worth it since just writing a small store function is a lot less Java than a whole MapReduce program.

Your second attempt is really close to what I would do. You should either copy/paste the source code for MultiStorage or use inheritance as a starting point. Then, modify the putNext method to strip out the group value, but still write to that file. Unfortunately, Tuple doesn't have a remove or delete method, so you'll have to rewrite the entire tuple. Or, if all you have is the original string, just pull that out and output that wrapped in a Tuple.

Some general documentation on writing Load/Store functions in case you need a bit more help: http://pig.apache.org/docs/r0.10.0/udf.html#load-store-functions

Upvotes: 2

Related Questions