csperson
csperson

Reputation: 901

Key renumbering in map reduce

I am new in hadoop and i am working with a programme that the input of map function is a file that keys are like this:

ID:      value:
3          sd
37          g
5675       gk
68         oi

My file is about 10 gigabytes and i want to change these Ids and renumber them in descending order. I don't want to change the values. My output must be like this:

 ID:        value:
 5675         sd
 68           g
 37           gk
 3            oi

I want to do this work in a cluster of nodes? How can i do that?

I think that i need a global variable and i can't do this in a cluster? What can i do?

Upvotes: 0

Views: 98

Answers (2)

greedybuddha
greedybuddha

Reputation: 7507

Before I say this, I like Arnon's answer for using hadoop.

But, since this is small file, 10G is not that big, and you only need to run it once, I would personally just write a small script.

Assuming a tab delimited file

sort myfile.txt > myfile.sorted.text
paste myfile.sorted.text myfile.text | cut -f1,4 > newFile.txt

That might take a long time, certainly longer than using hadoop, but is simple and works

Upvotes: 0

Arnon Rotem-Gal-Oz
Arnon Rotem-Gal-Oz

Reputation: 25919

You can do one map/reduce to order the ids then you'd have a file with the ids in descending order.

You can then write a second map/reduce that would join that file with the unsorted file where the mapper will emit enumerator (that can be calculated by the split size to facilitate multiple maps) so that the mapper that go over the fist file will emit "1 sd" "2 g" etc. and the mapper that processes the ids file would emit "1 5675" "2 68". The reducer will then join the files

here's an (untested) pig 0.11 script that would do something along these line:

A = load 'data' AS (id:chararray,value:chararray);
ID_RAW= FOREACH A GENERATE id;
DATA_RAW = FOREACH A GENERATE value;
ID_SORT= RANK ID_RAW BY id DESC DENSE;
DATA_SORT = RANK DATA_RAW DENSE;
ID_DATA = JOIN ID_SORT by $0, DATA_SORT by $0;
RESULT = FOREACH ID_DATA GENERATE ID_SORT::ID,DATA_SORT::value;
STORE RESULT to 'output';

Upvotes: 1

Related Questions