How to split a big Sequence file into multiple sequence files?

Question

I have a large sequence file with around 60 million entries (almost 4.5GB). I want to split it. For example, I want to split it into three parts, each having 20 million entries. So far my code is like this:

//Read from sequence file
  JavaPairRDD seqVectors = sc.sequenceFile(inputPath, IntWritable.class, VectorWritable.class);
  JavaPairRDD part=seqVectors.coalesce(3);
  part.saveAsHadoopFile(outputPath+File.separator+"output", IntWritable.class, VectorWritable.class, SequenceFileOutputFormat.class);

But unfortunately, each of the generated sequence files is around 4GB too (total 12GB)! Can anyone suggest a better/valid approach?

vefthym · Accepted Answer

Perhaps not the exact answer you are looking for, but it might be worth trying the second method for sequenceFile reading, the one that takes a minPartitions argument. Keep in mind that coalesce, which you are using, can only decrease the partitions.

Your code should then look like this:

//Read from sequence file
JavaPairRDD seqVectors = sc.sequenceFile(inputPath, IntWritable.class, VectorWritable.class, 3);
seqVectors.saveAsHadoopFile(outputPath+File.separator+"output", IntWritable.class, VectorWritable.class, SequenceFileOutputFormat.class);

Another thing that may cause problems is that some SequenceFiles are not splittable.

How to split a big Sequence file into multiple sequence files?

Answers (2)

Related Questions