Writing Spark dataframe in ORC format with Snappy compression

Question

I am successfull in reading a text file stored in S3 and writing it back to S3 in ORC format using Spark dataframes. - inputDf.write().orc(outputPath);
What I am not able to do is convert to ORC format with snappy compression. I already tried giving option while writing as setting the codec to snappy but Spark is still writing as normal ORC. How to achieve writing in ORC format with Snappy compression to S3 using Spark Dataframes?

abstractKarshit · Accepted Answer

For anyone facing the same issue, in Spark 2.0 this is possible by default. The default compression format for ORC is set to snappy.

public class ConvertToOrc {
    public static void main(String[] args) {
        SparkSession spark = SparkSession
                .builder()
                .appName("OrcConvert")
                .getOrCreate();
        String inputPath = args[0];
        String outputPath = args[1];

        Dataset inputDf = spark.read().option("sep", "\001").option("quote", "'").csv(inputPath);
        inputDf.write().format("orc").save(outputPath);

   }
}

Writing Spark dataframe in ORC format with Snappy compression

Answers (1)

Related Questions