Reputation: 28492
By default, newer versions of Spark use compression when saving text files. For example:
val txt = sc.parallelize(List("Hello", "world", "!"))
txt.saveAsTextFile("/path/to/output")
will create files in .deflate
format. It's quite easy to change compression algorithm, e.g. for .gzip
:
import org.apache.hadoop.io.compress._
val txt = sc.parallelize(List("Hello", "world", "!"))
txt.saveAsTextFile("/path/to/output", classOf[GzipCodec])
But is there a way to save RDD as a plain text files, i.e. without any compression?
Upvotes: 12
Views: 9486
Reputation: 35404
I can see the text file in HDFS without any compression with this code.
val conf = new SparkConf().setMaster("local").setAppName("App name")
val sc = new SparkContext(conf);
sc.hadoopConfiguration.set("mapred.output.compress", "false")
val txt = sc.parallelize(List("Hello", "world", "!"))
txt.saveAsTextFile("hdfs/path/to/save/file")
You can set all Hadoop related properties to hadoopConfiguration
on sc
.
Verified this code in Spark 1.5.2(scala 2.11).
Upvotes: 13