user3295674
user3295674

Reputation: 913

PIG parsing string inputs

I have two files,

one is titles.csv and has a movie ID and title with this format:

999: Title
734: Another_title

the other is a list of user IDs who link to the movie

categoryID: user1_id, ....

222: 120
227: 414 551
249: 555 

Of different sizes (minimum is one user per genre category)

The goal is to first parse the strings so that they are each split into two (for both files), everything before the ':' and everything after.

I have tried doing this

movies = LOAD .... USING PigStorage('\n') AS (line: chararray)
users = LOAD .... USING PigStorage('\n') AS (line: chararray)

-- parse 'users'/outlinks, make a list and count fields
tokenized = FOREACH users GENERATE FLATTEN(TOKENIZE(line, ':')) AS parameter;
filtered = FILTER tokenized BY INDEXOF(parameter, ' ') != -1;
result = FOREACH filtered GENERATE SUBSTRING(parameter, 2, (int)SIZE(parameter)) AS number;

But this is where I got stuck/confused. Thoughts?

I'm also supposed to output the top 10 entries who have the most user IDs in the second part of the string.

Upvotes: 1

Views: 1425

Answers (1)

Sivasakthi Jayaraman
Sivasakthi Jayaraman

Reputation: 4724

try like this

movies = LOAD 'file1' AS titleLine;
A = FOREACH movies GENERATE FLATTEN(REGEX_EXTRACT_ALL(titleLine,'^(.*):\\s+(.*)$')) AS (movieId:chararray,title:chararray);

users = LOAD 'file2' AS userLine;
B = FOREACH users GENERATE FLATTEN(REGEX_EXTRACT_ALL(userLine,'^(.*):\\s+(.*)$')) AS (categoryId:chararray,userId:chararray);

Output1:

(999,Title)
(734,Another_title)

Output2:

(222,120)
(227,414 551)
(249,555 )

Upvotes: 1

Related Questions