How to index a very large graph using Hadoop MapReduce?

Question

I have a very large graph (100 billions links, 1TB) in the form of a long text file, where each line defines the graph arc.

ref file

page1, page2
page3, page10
page5, page 1
.
.
.
pageN, pageM

where pageN can be any webpage.

To save space I want to convert this graph to an indexed version (with two files).

index file (node file):

page1, 1
page2, 2
page3, 3
page4, 4
.
.
.
pageN, N

and the arc file (links):

1, 2
3, 10
5, 1
.
.
.
N, M

Are there any MapReduce(Hadoop, Pig, etc) algorithms to do this conversion efficiently?

Answers (1)