Raja
Raja

Reputation: 493

saveAsTextFile performance improvement

I have used the datasource with the following format upto 1500000

1
2
3
4
5
..
1500000

I have use the following code snippet

JavaRDD<String> dataCollection=ctx.textFile("hdfs://yarncluster/Input/datasource");

JavaPairRDD<String,String> rdd=dataCollection.cartesian(dataCollection);

rdd.saveAsTextFile("hdfs://yarncluster/Ouput");

It take more time to save the data in cluster. Is there any other way to improve the performance?

Upvotes: 0

Views: 320

Answers (1)

Holden
Holden

Reputation: 7442

You could increase the level of parallelism by calling repartition with a large number of partitions.

Upvotes: 1

Related Questions