User
User

Reputation: 168

Complex in memory object to Parquet files without Apache Spark

I have an object that I would like to create a parquet file for each of those lists. That object looks like this:

case class ProgramsData(programs: List[Program], switches: List[Switch], paths: List[Path],
                    activities: List[Activity], enactments: List[Enactment])

After some research, I'm a bit baffled as to how to achieve this.

It seems that the only way to do this without Apache Spark is to convert your object into an Avro file, then read that in and get the schema, to which you can then create a parquet file from.

Is there a way to cut the middle man out and convert my object to a parquet file for each list inside the object? If not, what's the simplest way to achieve my end goal? Sadly most of the few examples I've seen either don't work or don't exactly do what I want.

Thanks in advance.

Edit So after another full day working on this, I've produced some code that generates a file. But when writing some code to read in the parquet files, it tells me it cannot decode it. Which tells me the writer code is wrong (though surely if it can't create a valid parquet file, it would throw an exception?)

Here is the writing code:

val avroSchema: Schema = ReflectData.get().getSchema(classOf[Program])

  val parquetOutputPath = new org.apache.hadoop.fs.Path(outputFilePath)

  val parquetWriter = new AvroParquetWriter[Record](parquetOutputPath, avroSchema)

  programs.foreach(program => {
    val programRecord = new GenericData.Record(avroSchema)
    programRecord.put("name", program.getName)

    parquetWriter.write(programRecord)
  })

  parquetWriter.close()

And here is the reading code:

 val avroSchema: Schema = ReflectData.get().getSchema(classOf[Program])

val filePath = new org.apache.hadoop.fs.Path(filePathString)

val parquetReader = new AvroParquetReader[Record](filePath)

val record = parquetReader.read()

The last line of the reading code throws the following exception:

.ParquetDecodingException: Can not read value at 1 in block 0 in file file:/tmp/parquet/test-1506962769.parquet

Hopefully someone can point me in the right direction, otherwise I may have to use Apache Spark just for its ability to easily create a parquet file, which is overkill.

Upvotes: 1

Views: 3109

Answers (2)

Devas
Devas

Reputation: 1694

You can write parquet file using avro schema without using spark.

Here is a sample code in java which writes parquet format to local disk.

{
final String schemaLocation = "/tmp/avro_format.json";
final Schema avroSchema = new Schema.Parser().parse(new File(schemaLocation));
final MessageType parquetSchema = new AvroSchemaConverter().convert(avroSchema);
final WriteSupport<Pojo> writeSupport = new AvroWriteSupport(parquetSchema, avroSchema);
final String parquetFile = "/tmp/parquet/data.parquet";
final Path path = new Path(parquetFile);
ParquetWriter<GenericRecord> parquetWriter = new ParquetWriter(path, writeSupport, CompressionCodecName.SNAPPY, BLOCK_SIZE, PAGE_SIZE);
final GenericRecord record = new GenericData.Record(avroSchema);
record.put("id", 1);
record.put("age", 10);
record.put("name", "ABC");
record.put("place", "BCD");
parquetWriter.write(record);
parquetWriter.close();
}

avro_format.json,

{
   "type":"record",
   "name":"Pojo",
   "namespace":"com.xx.test",
   "fields":[
      {
         "name":"id",
         "type":[
            "int",
            "null"
         ]
      },
      {
         "name":"age",
         "type":[
            "int",
            "null"
         ]
      },
      {
         "name":"name",
         "type":[
            "string",
            "null"
         ]
      },
      {
         "name":"place",
         "type":[
            "string",
            "null"
         ]
      }
   ]
}

Hope this helps.

Upvotes: 3

Mahesh Chand
Mahesh Chand

Reputation: 3250

Yes you can do it directly. Refer

You can get idea from here how to write your data to parquet. It takes an example of string. To write for more columns of different type. First we have to give schema that we explain in MessageType.

Upvotes: 1

Related Questions