Reputation: 913
I have two files,
one is titles.csv and has a movie ID and title with this format:
999: Title
734: Another_title
the other is a list of user IDs who link to the movie
categoryID: user1_id, ....
222: 120
227: 414 551
249: 555
Of different sizes (minimum is one user per genre category)
The goal is to first parse the strings so that they are each split into two (for both files), everything before the ':' and everything after.
I have tried doing this
movies = LOAD .... USING PigStorage('\n') AS (line: chararray)
users = LOAD .... USING PigStorage('\n') AS (line: chararray)
-- parse 'users'/outlinks, make a list and count fields
tokenized = FOREACH users GENERATE FLATTEN(TOKENIZE(line, ':')) AS parameter;
filtered = FILTER tokenized BY INDEXOF(parameter, ' ') != -1;
result = FOREACH filtered GENERATE SUBSTRING(parameter, 2, (int)SIZE(parameter)) AS number;
But this is where I got stuck/confused. Thoughts?
I'm also supposed to output the top 10 entries who have the most user IDs in the second part of the string.
Upvotes: 1
Views: 1425
Reputation: 4724
try like this
movies = LOAD 'file1' AS titleLine;
A = FOREACH movies GENERATE FLATTEN(REGEX_EXTRACT_ALL(titleLine,'^(.*):\\s+(.*)$')) AS (movieId:chararray,title:chararray);
users = LOAD 'file2' AS userLine;
B = FOREACH users GENERATE FLATTEN(REGEX_EXTRACT_ALL(userLine,'^(.*):\\s+(.*)$')) AS (categoryId:chararray,userId:chararray);
Output1:
(999,Title)
(734,Another_title)
Output2:
(222,120)
(227,414 551)
(249,555 )
Upvotes: 1