Corey J. Nolet
Corey J. Nolet

Reputation: 96

Combine and flatten many key/value tuples into a single tuple in pig

I am using Pig 0.8.1. I am somewhat new to Pig but I know there must be a reasonable and re-usable solution for how I want to work with my tuples. I have the following format (similar to triples):

Schema: (uuid, key, val)

Data:
(id1, 'name', 'Corey')
(id1, 'location', 'USA')
(id1, 'carsOwned', 5)
(id2, 'name', 'Paul')
(id2, 'location', 'CANADA')
(id2, 'carsOwned', 10)

The reason I'm representing this data in triples is because it's possible to have multi-valued keys, so pushing the data into a map is out of the question.

What I need to be able to do is find the ids, names and locations of the people with the top 10 cars owned. I'd like it if my output format could be this when sorted in descending order:

Schema: (uuid, name, location, carsOwned)

Data:
(id2, 'Paul', 'CANADA', 10)
(id1, 'Corey', 'USA', 5)

I have tried filtering my input into 3 different aliases (one where key == 'name', one where key == 'location' and one where key == 'carsOwned') so that I can use JOIN and bring them back into one tuple, but it appears that Pig ends up loading from the inputFormat 3 times instead of one. Maybe I'm doing that wrong?

I also tried grouping by id but then I can't seem to find a reasonable way to work with the bag of the actual triple key/values since they all have the exact same schema.

What I really want is to group by the id field and then flatten each of the keys but rename the alias to the actual name of the key.

Any ideas? Thanks in advance!

Upvotes: 1

Views: 3226

Answers (1)

reo katoa
reo katoa

Reputation: 5801

This solution is a bit sloppy, because your data is not organized in a way that Pig is really set up for -- i.e., conceptually each id show be a row key, with the fields named by what you have in the second column. But this can still be done, as long as your data is all reasonable. If you erroneously wind up with multiple rows with the same id and field name, but different values, this will get complicated.

Use a nested foreach to pick out the values for the three fields you're interested in.

keyedByID =
    /* Gather the rows by ID, then process each one in turn */
    FOREACH (GROUP Data BY id) {
        /* Pull out the fields you want. If you have duplicate rows,
           you'll need to use a LIMIT statement to ensure just a single record */
        name = FILTER Data BY field == 'name';
        location = FILTER Data BY field == 'location';
        carsOwned = FILTER Data BY field == 'carsOwned';
    GENERATE
        /* Output each field you want. You'll need to use FLATTEN since
           the things created above in the nested foreach are bags. */
        group AS id,
        FLATTEN(name) AS name,
        FLATTEN(locatioN) AS location,
        FLATTEN(carsOwned) AS carsOwned;
    };

Now you've got a relation that puts all the information for an ID on a single row, and you can do with it whatever you want. For example, you said wanted to pull out the top 10 car owners:

ordered = ORDER keyedByID BY carsOwned DESC;
top10 = LIMIT ordered 10;

Upvotes: 2

Related Questions