Reputation: 8967
I have a JavaRDD<Model>
, which i need to write it as more than one file with different layout [one or two fields in the RDD will be different between different layout].
When i use saveAsTextFile()
its calling the toString()
method of Model, it means same layout will be written as output.
Currently what i am doing is iterate the RDD using map
transformation method and return the different model with other layout, so i can use saveAsTextFile() action to write as different output file.
Just because of one or two fields are different , i need to iterate the entire RDD again and create new RDD then save it as output file.
For example:
Current RDD with fields:
RoleIndicator, Name, Age, Address, Department
Output File 1:
Name, Age, Address
Output File 2:
RoleIndicator, Name, Age, Department
Is there any optimal solution for this?
Regards, Shankar
Upvotes: 3
Views: 4963
Reputation: 39606
You want to use foreach
, not collect
.
You should define your function as an actual named class that extends VoidFunction
. Create instance variables for both files, and add a close()
method that closes the files. Your call()
implementation will write whatever you need.
Remember to call close()
on your function object after you're done.
Upvotes: 3
Reputation: 4372
It is possible with Pair RDD. Pair RDD can be stored in multiple files in a single iteration by using Hadoop Custom output format.
rdd.saveAsHadoopFile(path, key.class, value.class,CustomTextOutputFormat.class, jobConf);
public class FileGroupingTextOutputFormat extends MultipleTextOutputFormat<Text, Text> {
@Override
protected Text generateActualKey(Text key, Text value) {
return new Text();
}
@Override
protected Text generateActualValue(Text key, Text value) {
return value;
}
// returns a dynamic file name based on each RDD element
@Override
protected String generateFileNameForKeyValue(Text key, Text value, String name) {
return value.getSomeField() + "-" + name;
}
}
Upvotes: 0