bespam
bespam

Reputation: 11

How to index a very large graph using Hadoop MapReduce?

I have a very large graph (100 billions links, 1TB) in the form of a long text file, where each line defines the graph arc.

ref file

page1, page2
page3, page10
page5, page 1
.
.
.
pageN, pageM

where pageN can be any webpage.

To save space I want to convert this graph to an indexed version (with two files).

index file (node file):

page1, 1
page2, 2
page3, 3
page4, 4
.
.
.
pageN, N

and the arc file (links):

1, 2
3, 10
5, 1
.
.
.
N, M

Are there any MapReduce(Hadoop, Pig, etc) algorithms to do this conversion efficiently?

Upvotes: 0

Views: 100

Answers (1)

reo katoa
reo katoa

Reputation: 5801

With Pig this is easy. First you'll need to get a list of all the unique pages in your graph. You should be able to get this with DISTINCT and possibly UNION if there are pages in one column that don't appear in the other. Next, you can use the RANK function to assign each page a unique ID. Save that as your first file.

Then, you can use JOIN to bring in those IDs to your list of graph edges. Save that as your second file.

If you have any trouble with any of the steps, feel free to post a specific question about that step and we can help you with it.

Upvotes: 1

Related Questions