John Hurst
John Hurst

Reputation: 11

Structure of BigQuery nested arrays with AVRO or Parquet

I am trying to load Parquet data into Google BigQuery, to take advantage of the efficient columnar format and also (I hope) to get around BigQuery's lack of support for Logical Types (DATE etc) in AVRO files.

My data contain two levels of nested arrays.

Using JSON I can create and load the table with the desired structure:

bq mk temp.simple_interval simple_interval_bigquery_schema.json
bq load --source_format=NEWLINE_DELIMITED_JSON temp.simple_interval ~/Desktop/simple_interval.json
bq show temp.simple_interval

   Last modified                    Schema                   Total Rows   Total Bytes   Expiration   Time Partitioning   Labels
 ----------------- ---------------------------------------- ------------ ------------- ------------ ------------------- --------
  09 May 13:21:56   |- file_name: string (required)          3            246
                    |- file_created: timestamp (required)
                    |- id: string (required)
                    |- interval_length: integer (required)
                    +- days: record (repeated)
                    |  |- interval_date: date (required)
                    |  |- quality: string (required)
                    |  +- values: record (repeated)
                    |  |  |- interval: integer (required)
                    |  |  |- value: float (required)

I have tried to create the same structure with a Parquet data file using AvroParquetWriter. My AVRO schema is:

{
  "name": "simple_interval",
  "type": "record",
  "fields": [
    {"name": "file_name", "type": "string"},
    {"name": "file_created", "type": {"type": "long", "logicalType": "timestamp-millis"}},
    {"name": "id", "type": "string"},
    {"name": "interval_length", "type": "int"},
    {"name": "days", "type": {
      "type": "array",
      "items": {
        "name": "days_record",
        "type": "record",
        "fields": [
          {"name": "interval_date", "type": {"type": "int", "logicalType": "date"}},
          {"name": "quality", "type": "string"},
          {"name": "values", "type": {
            "type": "array",
            "items": {
              "name": "values_record",
              "type": "record",
              "fields": [
                {"name": "interval", "type": "int"},
                {"name": "value", "type": "float"}
              ]
            }
          }}
        ]
      }
    }}
  ]
}

From the AVRO specification, and what I find online, it seems that is necessary to nest 'record' nodes inside 'array' nodes in this way.

When I create my Parquet file, Parquet tools reports the schema as:

message simple_interval {
  required binary file_name (UTF8);
  required int64 file_created (TIMESTAMP_MILLIS);
  required binary id (UTF8);
  required int32 interval_length;
  required group days (LIST) {
    repeated group array {
      required int32 interval_date (DATE);
      required binary quality (UTF8);
      required group values (LIST) {
        repeated group array {
          required int32 interval;
          required float value;
        }
      }
    }
  }
}

I load the file into BigQuery and examine the results:

bq load --source_format=PARQUET temp.simple_interval ~/Desktop/simple_interval.parquet
bq show temp.simple_interval

   Last modified                      Schema                      Total Rows   Total Bytes   Expiration   Time Partitioning   Labels
 ----------------- --------------------------------------------- ------------ ------------- ------------ ------------------- --------
  09 May 13:05:54   |- file_name: string (required)               3            246
                    |- file_created: timestamp (required)
                    |- id: string (required)
                    |- interval_length: integer (required)
                    +- days: record (required)
                    |  +- array: record (repeated)           <-- extra column
                    |  |  |- interval_date: date (required)
                    |  |  |- quality: string (required)
                    |  |  +- values: record (required)
                    |  |  |  +- array: record (repeated)     <-- extra column
                    |  |  |  |  |- interval: integer (required)
                    |  |  |  |  |- value: float (required)

This is workable, but I was wondering, is there a way to avoid the extra 'array' intermediate nodes/columns?

Have I missed something? Is there a way with AVRO/Parquet to obtain the simpler BigQuery table structure as with JSON, for nested arrays?

Upvotes: 1

Views: 2719

Answers (1)

VictorGGl
VictorGGl

Reputation: 1916

I used this avro schema:

{
  "name": "simple_interval",
  "type": "record",
  "fields": [
    {"name": "file_name", "type": "string"},
    {"name": "file_created", "type": {"type": "long", "logicalType": "timestamp-millis"}},
    {"name": "id", "type": "string"},
    {"name": "interval_length", "type": "int"},
    {"name": "days", "type": {"type":"record","name":"days_", "fields": [
          {"name": "interval_date", "type": {"type": "int", "logicalType": "date"}},
          {"name": "quality", "type": "string"},
          {"name": "values", "type": {"type":"record", "name":"values_","fields": [
                {"name": "interval", "type": "int"},
                {"name": "value", "type": "float"}
          ]}}
    ]}}
  ]
}

I created an empty avro file out of it and I ran the command:

bq load --source_format=AVRO <dataset>.<table-name> <avro-file>.avro 

When running bq show <dataset>.<table-name> I get the following:

 Last modified                    Schema                    Total Rows   Total Bytes   Expiration   Time Partitioning   Labels   kmsKeyName  
 ----------------- ----------------------------------------- ------------ ------------- ------------ ------------------- -------- ------------ 
  22 May 09:46:02   |- file_name: string (required)           0            0                                                                   
                    |- file_created: integer (required)                                                                                        
                    |- id: string (required)                                                                                                   
                    |- interval_length: integer (required)                                                                                     
                    +- days: record (required)                                                                                                 
                    |  |- interval_date: integer (required)                                                                                    
                    |  |- quality: string (required)                                                                                           
                    |  +- values: record (required)                                                                                            
                    |  |  |- interval: integer (required)                                                                                      
                    |  |  |- value: float (required)      

Upvotes: 0

Related Questions