Reputation: 901
I am new in hadoop and i am working with a programme that the input of map function is a file that keys are like this:
ID: value:
3 sd
37 g
5675 gk
68 oi
My file is about 10 gigabytes and i want to change these Ids and renumber them in descending order. I don't want to change the values. My output must be like this:
ID: value:
5675 sd
68 g
37 gk
3 oi
I want to do this work in a cluster of nodes? How can i do that?
I think that i need a global variable and i can't do this in a cluster? What can i do?
Upvotes: 0
Views: 98
Reputation: 7507
Before I say this, I like Arnon's answer for using hadoop.
But, since this is small file, 10G is not that big, and you only need to run it once, I would personally just write a small script.
Assuming a tab delimited file
sort myfile.txt > myfile.sorted.text
paste myfile.sorted.text myfile.text | cut -f1,4 > newFile.txt
That might take a long time, certainly longer than using hadoop, but is simple and works
Upvotes: 0
Reputation: 25919
You can do one map/reduce to order the ids then you'd have a file with the ids in descending order.
You can then write a second map/reduce that would join that file with the unsorted file where the mapper will emit enumerator (that can be calculated by the split size to facilitate multiple maps) so that the mapper that go over the fist file will emit "1 sd" "2 g" etc. and the mapper that processes the ids file would emit "1 5675" "2 68". The reducer will then join the files
here's an (untested) pig 0.11 script that would do something along these line:
A = load 'data' AS (id:chararray,value:chararray);
ID_RAW= FOREACH A GENERATE id;
DATA_RAW = FOREACH A GENERATE value;
ID_SORT= RANK ID_RAW BY id DESC DENSE;
DATA_SORT = RANK DATA_RAW DENSE;
ID_DATA = JOIN ID_SORT by $0, DATA_SORT by $0;
RESULT = FOREACH ID_DATA GENERATE ID_SORT::ID,DATA_SORT::value;
STORE RESULT to 'output';
Upvotes: 1