JiriS
JiriS

Reputation: 7540

Merge parquet file on standalone spark

Is there a simple way how to save DataFrame into a single parquet file or merge the directory containing metadata and parts of this parquet file produced by sqlContext.saveAsParquetFile() into a single file stored on NFS without using HDFS and hadoop?

Upvotes: 2

Views: 4473

Answers (3)

Jeff A.
Jeff A.

Reputation: 81

coalesce(N) has saved me so far.
If your table is partitioned, then use repartition("partition key") as well.

Upvotes: 0

ekrich
ekrich

Reputation: 377

I was able to use this method to compress parquet files using snappy format with Spark 1.6.1. I used overwrite so that I could repeat the process if needed. Here is the code.

import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.SaveMode

object CompressApp {
  val serverPort = "hdfs://myserver:8020/"
  val inputUri = serverPort + "input"
  val outputUri = serverPort + "output"

  val config = new SparkConf()
           .setAppName("compress-app")
           .setMaster("local[*]")
  val sc = SparkContext.getOrCreate(config)
  val sqlContext = SQLContext.getOrCreate(sc)
  sqlContext.setConf("spark.sql.parquet.compression.codec", "snappy")  
  import sqlContext.implicits._

  def main(args: Array[String]) {
    println("Compressing Parquet...")
    val df = sqlContext.read.parquet(inputUri).coalesce(1)
    df.write.mode(SaveMode.Overwrite).parquet(outputUri)
    println("Done.")
  }
}

Upvotes: 0

Patrick McGloin
Patrick McGloin

Reputation: 2234

To save only one file, rather than many, you can call coalesce(1) / repartition(1) on the RDD/Dataframe before the data is saved.

If you already have a directory with small files, you could create a Compacter process which would read in the exiting files and save them to one new file. E.g.

val rows = parquetFile(...).coalesce(1)
rows.saveAsParquetFile(...)

You can store to a local file system using saveAsParquetFile. e.g.

rows.saveAsParquetFile("/tmp/onefile/")

Upvotes: 5

Related Questions