Reputation: 447
I am in the process of learning Parquet File's internal representation, so I went through Apache Parquet's Github page, Google Dremel's paper to understand the definition and repetition levels and Twitter's blog to learn more about Parquet file.
To relate my understanding of its representation that I gained through my read with the actual Parquet files representation, I used parquet-tools
command with meta
option for one of the sample Parquet file and it printed details with 3 major sections, Header, File schema and Row_groups. I understood the details presented under the first 2 sections but I couldn't completely understand all the details present in the row group section.
Below are the questions that I have.
DO
, FPO
, VC
(This looks like the count of all the rows in the current row group) is. Expansion of what it stands for can be found in the parquet-tools Github page but I wanted to get more details about it. I understand what SZ
and ST
is.ENC
I see list of encoding schemes like BIT_PACKED
, PLAIN
,RLE
. I understand what it means individually but I do not understand why there are at least 3 encoding schemes used all the times. RC
and total size TS
of the row group, I see OFFSET
. For the first page it is 4 always. How is it calculated?.Unfortunately I couldn't attach the snippet of the parquet-tools meta
command's output due to security constraints but I hope it will not be too much to visualize what I mean in each of questions.
Upvotes: 5
Views: 3462
Reputation: 11
I know this is an old one, but with nothing still documented anywhere I found this helpful.
In org.apache.parquet:parquet-cli:org/apache/parquet/cli/Util.java
public static String encodingsAsString(Set<Encoding> encodings, ColumnDescriptor desc) {
StringBuilder sb = new StringBuilder();
if (encodings.contains(RLE) || encodings.contains(BIT_PACKED)) {
sb.append(desc.getMaxDefinitionLevel() == 0 ? "B" : "R");
sb.append(desc.getMaxRepetitionLevel() == 0 ? "B" : "R");
if (encodings.contains(PLAIN_DICTIONARY)) {
sb.append("R");
}
if (encodings.contains(PLAIN)) {
sb.append("_");
}
} else {
sb.append("RR");
if (encodings.contains(RLE_DICTIONARY)) {
sb.append("R");
}
if (encodings.contains(PLAIN)) {
sb.append("_");
}
if (encodings.contains(DELTA_BYTE_ARRAY)
|| encodings.contains(DELTA_BINARY_PACKED)
|| encodings.contains(DELTA_LENGTH_BYTE_ARRAY)) {
sb.append("D");
}
}
return sb.toString();
}
Upvotes: 1
Reputation: 341
This page has the best description i found: https://github.com/apache/parquet-mr/tree/master/parquet-tools-deprecated
So it seems tat DO, FPO are just offset infomation where the values of this particular columns starts. VC = Value count of existing non null values.
Parquet file creation with pandas
import pandas as pd
df = pd.DataFrame({
'w1': ["John", "Max", "Hans"],
'w2': ["Doe", "Mustermann", "Peter"],
'w3': ["New York", "Berlin", "München"],
'w4': [1990, 1980, 1970]})
df.to_parquet('./test_pandas.lz4.parquet', compression="lz4")
Meta output of parquet-tools.jar with java -jar ./parquet-tools-1.10.1.jar meta <file>
file schema: schema
--------------------------------------------------------------------------------
w1: OPTIONAL BINARY O:UTF8 R:0 D:1
w2: OPTIONAL BINARY O:UTF8 R:0 D:1
w3: OPTIONAL BINARY O:UTF8 R:0 D:1
w4: OPTIONAL INT64 R:0 D:1
row group 1: RC:3 TS:440 OFFSET:4
--------------------------------------------------------------------------------
w1: BINARY LZ4 DO:4 FPO:51 SZ:98/79/0.81 VC:3 ENC:PLAIN_DICTIONARY,RLE,PLAIN ST:[min: Hans, max: Max, num_nulls: 0]
w2: BINARY LZ4 DO:165 FPO:219 SZ:106/87/0.82 VC:3 ENC:PLAIN_DICTIONARY,RLE,PLAIN ST:[min: Doe, max: Peter, num_nulls: 0]
w3: BINARY LZ4 DO:337 FPO:394 SZ:115/97/0.84 VC:3 ENC:PLAIN_DICTIONARY,RLE,PLAIN ST:[min: Berlin, max: New York, num_nulls: 0]
w4: INT64 LZ4 DO:524 FPO:565 SZ:121/109/0.90 VC:3 ENC:PLAIN_DICTIONARY,RLE,PLAIN ST:[min: 1970, max: 1990, num_nulls: 0]
Upvotes: 0