Reputation: 7540
Is there a simple way how to save DataFrame
into a single parquet
file or merge the directory containing metadata and parts of this parquet
file produced by sqlContext.saveAsParquetFile()
into a single file stored on NFS without using HDFS and hadoop?
Upvotes: 2
Views: 4473
Reputation: 81
coalesce(N)
has saved me so far.
If your table is partitioned, then use repartition("partition key")
as well.
Upvotes: 0
Reputation: 377
I was able to use this method to compress parquet files using snappy format with Spark 1.6.1. I used overwrite so that I could repeat the process if needed. Here is the code.
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.SaveMode
object CompressApp {
val serverPort = "hdfs://myserver:8020/"
val inputUri = serverPort + "input"
val outputUri = serverPort + "output"
val config = new SparkConf()
.setAppName("compress-app")
.setMaster("local[*]")
val sc = SparkContext.getOrCreate(config)
val sqlContext = SQLContext.getOrCreate(sc)
sqlContext.setConf("spark.sql.parquet.compression.codec", "snappy")
import sqlContext.implicits._
def main(args: Array[String]) {
println("Compressing Parquet...")
val df = sqlContext.read.parquet(inputUri).coalesce(1)
df.write.mode(SaveMode.Overwrite).parquet(outputUri)
println("Done.")
}
}
Upvotes: 0
Reputation: 2234
To save only one file, rather than many, you can call coalesce(1)
/ repartition(1)
on the RDD/Dataframe before the data is saved.
If you already have a directory with small files, you could create a Compacter process which would read in the exiting files and save them to one new file. E.g.
val rows = parquetFile(...).coalesce(1)
rows.saveAsParquetFile(...)
You can store to a local file system using saveAsParquetFile. e.g.
rows.saveAsParquetFile("/tmp/onefile/")
Upvotes: 5