br0ken.pipe
br0ken.pipe

Reputation: 910

Writing a Parquet file from a CSV file using Apache Spark in Java

I would like to convert CSV to Parquet using spark-csv.

Reading the file and saving it as a dataset works. Unfortunately, I can't write it back as a Parquet file. Is there any way to achieve this?

SparkSession spark = SparkSession.builder().appName("Java Spark SQL basic example")
        .config("spark.master", "local").config("spark.sql.warehouse.dir", "file:///C:\\spark_warehouse")
        .getOrCreate();

Dataset<Row> df = spark.read().format("com.databricks.spark.csv").option("inferSchema", "true")
        .option("header", "true").load("sample.csv");

df.write().parquet("test.parquet");

Exception:

17/04/11 09:57:32 ERROR Executor: Exception in task 0.0 in stage 3.0
(TID 3) java.lang.NoSuchMethodError:
 org.apache.parquet.column.ParquetProperties.builder()Lorg/apache/parquet/column/ParquetProperties$Builder;
    at
 org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:362)
    at
 org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:350)
    at
 org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.<init>(ParquetOutputWriter.scala:37)
    at
 org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anon$1.newInstance(ParquetFileFormat.scala:145)
    at
 org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.<init>(FileFormatWriter.scala:234)
    at
 org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:182)
    at
 org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129)
    at
 org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
    at org.apache.spark.scheduler.Task.run(Task.scala:99)   at
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
    at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

Upvotes: 0

Views: 4519

Answers (1)

br0ken.pipe
br0ken.pipe

Reputation: 910

I fixed it with a workaround. I had to comment out these two parquet dependencies, but i'm not really sure why they get in each other's way:

<!--        <dependency> -->
<!--            <groupId>org.apache.parquet</groupId> -->
<!--            <artifactId>parquet-hadoop</artifactId> -->
<!--            <version>1.9.0</version> -->
<!--        </dependency> -->


<!--        <dependency> -->
<!--            <groupId>org.apache.parquet</groupId> -->
<!--            <artifactId>parquet-common</artifactId> -->
<!--            <version>1.9.0</version> -->
<!--        </dependency> -->

Upvotes: 1

Related Questions