Reputation: 11
I have a very large graph (100 billions links, 1TB) in the form of a long text file, where each line defines the graph arc.
ref file
page1, page2
page3, page10
page5, page 1
.
.
.
pageN, pageM
where pageN can be any webpage.
To save space I want to convert this graph to an indexed version (with two files).
index file (node file):
page1, 1
page2, 2
page3, 3
page4, 4
.
.
.
pageN, N
and the arc file (links):
1, 2
3, 10
5, 1
.
.
.
N, M
Are there any MapReduce(Hadoop, Pig, etc) algorithms to do this conversion efficiently?
Upvotes: 0
Views: 100
Reputation: 5801
With Pig this is easy. First you'll need to get a list of all the unique pages in your graph. You should be able to get this with DISTINCT
and possibly UNION
if there are pages in one column that don't appear in the other. Next, you can use the RANK
function to assign each page a unique ID. Save that as your first file.
Then, you can use JOIN
to bring in those IDs to your list of graph edges. Save that as your second file.
If you have any trouble with any of the steps, feel free to post a specific question about that step and we can help you with it.
Upvotes: 1