jake wong
jake wong

Reputation: 5228

how to write parquet files in java with apache arrow

I am trying to write data in java into apache parquet. So far, what i've done is use apache arrow via the examples here: https://arrow.apache.org/cookbook/java/schema.html#creating-fields and create an arrow format dataset.

Question is, how do I write it into parquet after that? Also, do I need to use apache arrow to output the data as a parquet file? or can I use apache parquet directly to serialize the data and then output it as a parquet file?

what i've done:

try (BufferAllocator allocator = new RootAllocator()) {
    Field name = new Field("name", FieldType.nullable(new ArrowType.Utf8()), null);
    Field age = new Field("age", FieldType.nullable(new ArrowType.Int(32, true)), null);
    Schema schemaPerson = new Schema(asList(name, age));
    try(
        VectorSchemaRoot vectorSchemaRoot = VectorSchemaRoot.create(schemaPerson, allocator)
    ){
        VarCharVector nameVector = (VarCharVector) vectorSchemaRoot.getVector("name");
        nameVector.allocateNew(3);
        nameVector.set(0, "David".getBytes());
        nameVector.set(1, "Gladis".getBytes());
        nameVector.set(2, "Juan".getBytes());
        IntVector ageVector = (IntVector) vectorSchemaRoot.getVector("age");
        ageVector.allocateNew(3);
        ageVector.set(0, 10);
        ageVector.set(1, 20);
        ageVector.set(2, 30);
        vectorSchemaRoot.setRowCount(3);
        File file = new File("randon_access_to_file.arrow");
        try (
            FileOutputStream fileOutputStream = new FileOutputStream(file);
            ArrowFileWriter writer = new ArrowFileWriter(vectorSchemaRoot, null, fileOutputStream.getChannel())
        ) {
            writer.start();
            writer.writeBatch();
            writer.end();
            System.out.println("Record batches written: " + writer.getRecordBlocks().size() + ". Number of rows written: " + vectorSchemaRoot.getRowCount());
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

but this outputs as an arrow file. not a parquet. Any ideas how I can output this to parquet file instead? And do i need arrow to generate a parquet file to begin with - or can i just use parquet directly?

Upvotes: 3

Views: 4613

Answers (2)

Ahmed Riza
Ahmed Riza

Reputation: 53

We can see that the tests in the Java Arrow implementation are using the parquet-hadoop libraries as can be seen from the POM. This is a bit unfortunate at the moment, since parquet-hadoop has dependencies on hadoop libraries such as hadoop-common which is notorious for big dependency chains (and a lot of CVEs).

Even the latest version of hadoop-common has 15 CVEs . Other language implementations of Arrow (such as C++ or Rust) do not require this and they can be used for easier Parquet integration.

Upvotes: 1

L. Blanc
L. Blanc

Reputation: 2310

Arrow Java does not yet support writing to Parquet files, but you can use Parquet to do that.

There is some code in the Arrow dataset test classes that may help. See

org.apache.arrow.dataset.ParquetWriteSupport;
org.apache.arrow.dataset.file.TestFileSystemDataset; 

The second class has some tests that use the utilities in the first one.

You can find them on GitHub here: https://github.com/apache/arrow/tree/master/java/dataset/src/test/java/org/apache/arrow/dataset

Upvotes: 3

Related Questions