HubCap
HubCap

Reputation: 333

Order of Apache Pig Transformations

I am reading through Pig Programming by Alan Gates.

Consider the code:

ratings = LOAD '/user/maria_dev/ml-100k/u.data' AS 
    (userID:int, movieID:int, rating:int, ratingTime:int);

metadata = LOAD '/user/maria_dev/ml-100k/u.item' USING PigStorage ('|') AS 
    (movieID:int, movieTitle:chararray, releaseDate:chararray, imdbLink: chararray);

nameLookup = FOREACH metadata GENERATE 
    movieID, movieTitle, ToDate(releaseDate, 'dd-MMM-yyyy') AS releaseYear;

nameLookupYear = FOREACH nameLookup GENERATE 
    movieID, movieTitle, GetYear(releaseYear) AS finalYear;

filterMovies = FILTER nameLookupYear BY finalYear < 1982;

groupedMovies = GROUP filterMovies BY finalYear;

orderedMovies = FOREACH groupedMovies {
    sortOrder = ORDER metadata by finalYear DESC;
    GENERATE GROUP, finalYear;
    };

DUMP orderedMovies;

It states that

"Sorting by maps, tuples or bags produces error".

I want to know how I can sort the grouped results.

Do the transformations need to follow a certain sequence for them to work?

Upvotes: 0

Views: 65

Answers (2)

savagedata
savagedata

Reputation: 722

Since you are trying to sort the grouped results, you do not need a nested foreach. You would use the nested foreach if you were trying to, for example, sort each movie within the year by title or release date. Try ordering as usual (refer to finalYear as group since you grouped by finalYear in the previous line):

orderedMovies = ORDER groupedMovies BY group ASC;

DUMP orderedMovies;

Upvotes: 1

nobody
nobody

Reputation: 11080

If you are looking to sort the grouped values then you will have to use nested foreach. This will sort the years in descending order within a group.

orderedMovies = FOREACH groupedMovies {
      sortOrder = ORDER metadata by GetYear(ToDate(releaseDate, 'dd-MMM-yyyy')) DESC;
      GENERATE GROUP, movieID, movieTitle;
};

Upvotes: 1

Related Questions