Reputation: 493
I have used the datasource with the following format upto 1500000
1
2
3
4
5
..
1500000
I have use the following code snippet
JavaRDD<String> dataCollection=ctx.textFile("hdfs://yarncluster/Input/datasource");
JavaPairRDD<String,String> rdd=dataCollection.cartesian(dataCollection);
rdd.saveAsTextFile("hdfs://yarncluster/Ouput");
It take more time to save the data in cluster. Is there any other way to improve the performance?
Upvotes: 0
Views: 320
Reputation: 7442
You could increase the level of parallelism by calling repartition with a large number of partitions.
Upvotes: 1