Reputation: 79
I am very new to hadoop so please bear with me. Any help would be appreciated.
I need to join 2 tables, Table 1 will have pagename , pagerank for eg. Actual data set is huge but with the similar pattern
pageA,0.13
pageB,0.14
pageC,0.53
Table 2, it is a simple wordcount kind of table with word , pagename for eg. actual dataset is huge but with similar pattern
test,pageA:pageB
sample,pageC
json,pageC:pageA:pageD
Now if user searches for any word from second table, I should give him the results of pages based on their pagerank from table 1.
Output when searched for test,
test = pageB,pageA
My approach was to load the first table into distributed cache. Read second table in map method get the list of pages for the word, sort the list using the pagerank info from first table which is loaded into distributed cache. This works for the dataset i am working but wanted to know if there was any better way, also would like to know how can this join be done with pig or hive.
Upvotes: 3
Views: 692
Reputation: 665
A simple approach using a pig script:
PAGERANK = LOAD 'hdfs/pagerank/dataset/location' USING PigStorage(',')
AS (page:chararray, rank:float);
WORDS_TO_PAGES = LOAD 'hdfs/words/dataset/location' USING PigStorage(',')
AS (word:chararray, pages:chararray);
PAGES_MATCHING = FOREACH (FILTER WORDS_TO_PAGES BY word == '$query_word') GENERATE FLATTEN(TOKENIZE(pages, ':'));
RESULTS = FOREACH (JOIN PAGERANK BY page, PAGES_MATCHING BY $0) GENERATE page, rank;
SORTED_RESULTS = ORDER RESULTS BY rank DESC;
DUMP SORTED_RESULTS;
The script needs one parameter, which is the query word:
pig -f pagerank_join.pig -param query_word=test
Upvotes: 1