Reputation: 721
I have a large sequence file with around 60 million entries (almost 4.5GB). I want to split it. For example, I want to split it into three parts, each having 20 million entries. So far my code is like this:
//Read from sequence file
JavaPairRDD<IntWritable,VectorWritable> seqVectors = sc.sequenceFile(inputPath, IntWritable.class, VectorWritable.class);
JavaPairRDD<IntWritable,VectorWritable> part=seqVectors.coalesce(3);
part.saveAsHadoopFile(outputPath+File.separator+"output", IntWritable.class, VectorWritable.class, SequenceFileOutputFormat.class);
But unfortunately, each of the generated sequence files is around 4GB too (total 12GB)! Can anyone suggest a better/valid approach?
Upvotes: 1
Views: 755
Reputation: 7462
Perhaps not the exact answer you are looking for, but it might be worth trying the second method for sequenceFile reading, the one that takes a minPartitions argument. Keep in mind that coalesce
, which you are using, can only decrease the partitions.
Your code should then look like this:
//Read from sequence file
JavaPairRDD<IntWritable,VectorWritable> seqVectors = sc.sequenceFile(inputPath, IntWritable.class, VectorWritable.class, 3);
seqVectors.saveAsHadoopFile(outputPath+File.separator+"output", IntWritable.class, VectorWritable.class, SequenceFileOutputFormat.class);
Another thing that may cause problems is that some SequenceFiles are not splittable.
Upvotes: 1
Reputation: 106
Maybe I'm not understanding your question right, but why not just read your file line by line (= entry by entry?) and build your three files this way? It would be something like this:
int i = 0;
List<PrintWriter> files = new ArrayList<PrintWriter>();
files.add(new PrintWriter("the-file-name1.txt", "UTF-8"));
files.add(new PrintWriter("the-file-name2.txt", "UTF-8"));
files.add(new PrintWriter("the-file-name3.txt", "UTF-8"));
for String line in Files.readAllLines(Paths.get(fileName)){
files.get(i % 3).writeln(line);
i++;
}
In this case, one line every three line goes into the frist, the second and the third file.
An other solution would be to make a binary read, if the file is not a text file, using Files.readAllBytes(Paths.get(inputFileName))
and writing into your output files with Files.write(Paths.get(output1), byteToWrite)
.
However, I do not have an answer to why the output takes so much more place in the way you are doing it. Maybe the encoding is guilty? I think java encodes in UTF-8 by default and your input file might be encoded in ASCII.
Upvotes: 0