Shulhi Sapli
Shulhi Sapli

Reputation: 2476

Aggregate row value into columns

I have data like this:

2013-11    localhost       kern
2013-11    localhost       kern
2013-11    192.168.0.59    daemon
2013-12    localhost       kern
2013-12    localhost       daemon
2013-12    localhost       mail

You get the idea. I'm trying to group the above by date (as the row key) and have a column which correspond to the count of each kern, daemon, etc. In short, my desired output should be as below:

-- date, count(kern), count(daemon), count(mail)
(2013-11, 2, 1, 0)
(2013-12, 1, 1, 1)

Currently, my approach is like this.

valid_records = FILTER formatted_records BY date is not null;

date_group = GROUP valid_records BY date;
facilities = FOREACH date_group {
    -- DUMB way to filter for now :(
    kern = FILTER valid_records BY facility == 'kern';
    user = FILTER valid_records BY facility == 'user';
    daemon = FILTER valid_records BY facility == 'daemon';

    -- this need to be in order so it get mapped correctly to HBase
    GENERATE group, COUNT(kern), COUNT(user), COUNT(daemon);
}

Two questions:

  1. I have 3 filters in above, but in production, there should be more than that maybe 10 filters. Is there any performance hit if I'm using a lot of FILTER like above?

  2. Any other better way to do this?

Upvotes: 3

Views: 272

Answers (2)

Ruslan
Ruslan

Reputation: 3273

I think your problem is that you are looking for an output with a floating kind of schema. But it seems that all you have to do is to group by a composite key: with this script:

formatted_records = LOAD 'input' AS (date: chararray, host: chararray, facility: chararray);
valid_records = FILTER formatted_records BY date is not null;
counts = FOREACH (GROUP valid_records BY (date, facility)) GENERATE
        group.date AS date,
        group.facility AS facility,
        COUNT(valid_records) AS the_count;
DUMP counts;

You will get:

(2013-11,kern,2)
(2013-11,daemon,1)
(2013-12,kern,1)
(2013-12,mail,1)
(2013-12,daemon,1)

which gives just the same information.

If you want to format the output in a fancy way like yours then it is better to use a general-purpose language (like Java or Python) for such tasks separately (assuming that Pig's output is small enough to fit in memory). Pig is not good at this.

Upvotes: 1

davek
davek

Reputation: 22895

As soon as you have a group operation, you will have a reduce stage, which means you are very likely to have the overhead of writing interim mapper outputs, secondary sorts etc. There are other pig operators that trigger a reduce phase, but I don't think FILTER is one of them: the mapper tasks can filter inputs and create multiple outputs that are then sorted and sent to the single reduce phase.

In other words, what looks messy in pig can actually be quite efficient, and what looks concise can spawn a lot more map-reduce overhead than you might expect.

If you run your pig job from the command line you will see a summary of operators in use and a list of M/R jobs that result from the logical plan.

Upvotes: 0

Related Questions