Reputation: 91
I'm using Parquet CPP library to write data from MySQL database to a parquet file. I have two questions:
1) What does the REPETITION in schema mean? Is it related to table constraints when we define a column as NULL or NOT NULL?
2) How to insert NULL value into a column? Do I just pass a null pointer to the value parameter?
WriteBatch(int64_t num_levels, const int16_t* def_levels,
const int16_t* rep_levels,
const typename ParquetType::c_type* values)
Thanks in advance!
Upvotes: 2
Views: 4194
Reputation: 11
@Ivy.W I have been using parquet CPP recently at work and this is what I understood
Parquet schema needs to know about each column of the table that you are going to read from and write to. If the column is nullable then it means that the repetitionType
is optional, if it is not nullable it means the repetitionType
is required else it will be repeated (for nested structures like map, list etc). Let me give a quick intro to definition and repetition levels:
The definition level in parquet is to identify if the value to be written is nullable or not I.e we should tell the level for which the particular field is NULL
. So basically, if you want to reconstruct the schema back, we can use the definition and repetition levels.
A field can be Optional/required/repeated. If the field is required, it means it can't be null so the definition level is not required. If it is optional, it will be 0
for null
and 1
for non-null. If the schema is nested, we use additional values accordingly.
e.g
message ExampleDefinitionLevel {
optional group a {
optional group b {
optional string c;
}
}
}
definition level for a would be 0, for b would be 1 for c would be 2. enter image description here
Repetition level: Repetition level is only applicable for nested structures such as lists, map etc. for e.g when a user can have multiple phone numbers the field will be "repeated". e.g
message list{
repeated string list;
}
The data would be like: ["a","b","c"]
and would look like:
{
list:"a",
list:"b",
list:"c"
}
To write null, make sure the schema knows that the column is nullable and just pass the definition level as 0
and parquet writebatch
should take care of the rest.
Please refer to https://blog.twitter.com/engineering/en_us/a/2013/dremel-made-simple-with-parquet.html
Upvotes: 1