user2664210
user2664210

Reputation: 145

How to use CoGroup of Cascading

I am having recursive directory structure having different number of part files. I want to apply CoGroup on these files.

Suppose, my directory structure is like:

directory1/dir1/part-0000
               /part-0001
               /part-0002
           dir2/part-0000
               /part-0001
               /part-0002
           dir3/part-0000
               /part-0001
               /part-0002
           dir4/part-0000
               /part-0001
               /part-0002

These part files contain tab separated data like:
field1 field2 field3 field4 field5

I want to merge all the part files with common value field1,field3,field4 and field5. That is, final output file will contain data like:

field1 field2_dir1_files field2_dir2_files field2_dir3_files field2_dir4_files field3 field4 field5

If any MapReduce solution is there, you are most welcome, I will try with that also :)
How will it be possible by using Cascading CoGroup API?? Please help me to resolve this, I am trying to solve this issue from last two weeks.

Thanks in advance!

Upvotes: 1

Views: 662

Answers (1)

Nagendra kumar
Nagendra kumar

Reputation: 225

Here we can slove this by using simple Mixed join that is provided in the cascading

http://docs.cascading.org/cascading/2.5/javadoc/cascading/pipe/joiner/MixedJoin.html

Firstly connect the each input path to each pipe and merge pipes relate to directory.

let the merge output Pipes we get dir1,dir2,dir3 which will have fileds

field1 field2 field3 field4 field5

and create an array of these pipes as dir[]

create the join fields array of each pipe by which we are joining that is for each pipe by field1 field3 field4 field5

Fields outputFields =new Fields("field1","field2_dir1_files","field3","field4","field5","2field1","2field2_dir2_files","2field3","2field4","2field5","3field1","field2_dir3_files","3field3","3field4","3field5");

boolean[] i ={false,false,false}

Pipe LastJoin = joiningPipe = new CoGroup(dir[],JoinFields[],new MixedJoin(outputFields,i);

Pipe required = new Retain("field1","field2_dir1_files","field2_dir2_files","field2_dir3_files","field3","field4","field5");

retain inoder to retain the fields that are required in the output

Upvotes: 1

Related Questions