Reputation: 680
I have a big file that is formatted as follows
sample name \t index \t score
And I'm trying to split this file based off of sample name using Hadoop Streaming. I know ahead of time how many samples there are, so can specify how many reducers I need. This post is doing something very similar, so I know that this is possible.
I tried using the following script to split this file into 16 files (there are 16 samples)
hadoop jar $STREAMING \
-D mapred.text.key.partitioner.options=-k1,1 \
-D stream.num.map.output.key.fields=2 \
-D mapred.reduce.tasks=16 \
-partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner \
-mapper cat \
-reducer org.apache.hadoop.mapred.lib.IdentityReducer \
-input input_dir/*part* -output output_dir
This somewhat works - some of the files contain only one sample name. However most of the part* files are blank and some of the part* files contain multiple sample names.
Is there a better way to make sure that every reducer gets only one sample name?
Upvotes: 2
Views: 1209