Reputation: 145
I am having recursive directory structure having different number of part files. I want to apply CoGroup on these files.
Suppose, my directory structure is like:
directory1/dir1/part-0000
/part-0001
/part-0002
dir2/part-0000
/part-0001
/part-0002
dir3/part-0000
/part-0001
/part-0002
dir4/part-0000
/part-0001
/part-0002
These part files contain tab separated data like:
field1 field2 field3 field4 field5
I want to merge all the part files with common value field1
,field3
,field4
and field5
. That is, final output file will contain data like:
field1 field2_dir1_files field2_dir2_files field2_dir3_files field2_dir4_files field3 field4 field5
If any MapReduce solution is there, you are most welcome, I will try with that also :)
How will it be possible by using Cascading CoGroup API??
Please help me to resolve this, I am trying to solve this issue from last two weeks.
Thanks in advance!
Upvotes: 1
Views: 662
Reputation: 225
Here we can slove this by using simple Mixed join that is provided in the cascading
http://docs.cascading.org/cascading/2.5/javadoc/cascading/pipe/joiner/MixedJoin.html
Firstly connect the each input path to each pipe and merge pipes relate to directory.
let the merge output Pipes we get dir1,dir2,dir3 which will have fileds
field1 field2 field3 field4 field5
and create an array of these pipes as dir[]
create the join fields array of each pipe by which we are joining that is for each pipe by field1 field3 field4 field5
Fields outputFields =new Fields("field1","field2_dir1_files","field3","field4","field5","2field1","2field2_dir2_files","2field3","2field4","2field5","3field1","field2_dir3_files","3field3","3field4","3field5");
boolean[] i ={false,false,false}
Pipe LastJoin = joiningPipe = new CoGroup(dir[],JoinFields[],new MixedJoin(outputFields,i);
Pipe required = new Retain("field1","field2_dir1_files","field2_dir2_files","field2_dir3_files","field3","field4","field5");
retain inoder to retain the fields that are required in the output
Upvotes: 1