I am in the process of learning Parquet File's internal representation, so I went through Apache Parquet's Github page , Google Dremel's paper to understand the definition and repetition levels and Twitter's blog to learn more about Parquet file. To relate my understanding of its representation that I gained through my read with the actual Parquet files representation, I used parquet-tools command with meta option for one of the sample Parquet file and it printed details with 3 major sections, Header, File schema and Row_groups. I understood the details presented under the first 2 sections but I couldn't completely understand all the details present in the row group section. Below are the questions that I have. Wanted to know more about what DO , FPO , VC (This looks like the count of all the rows in the current row group) is. Expansion of what it stands for can be found in the parquet-tools Github page but I wanted to get more details about it. I understand what SZ and ST is. Next to ENC I see list of encoding schemes like BIT_PACKED , PLAIN , RLE . I understand what it means individually but I do not understand why there are at least 3 encoding schemes used all the times. Next to Record count RC and total size TS of the row group, I see OFFSET . For the first page it is 4 always. How is it calculated?. I came to know Parquet file's header and footer has 4 digit magic code as "PAR1", Does it have any special meaning? or just some arbitratry text to decide if the file is Parquet or not (without depending on the file extension). Unfortunately I couldn't attach the snippet of the parquet-tools meta command's output due to security constraints but I hope it will not be too much to visualize what I mean in each of questions.

Reputation: 447

Understanding Parquet File's metadata information printed with parquet-tools "meta" command

I am in the process of learning Parquet File's internal representation, so I went through Apache Parquet's Github page, Google Dremel's paper to understand the definition and repetition levels and Twitter's blog to learn more about Parquet file.

To relate my understanding of its representation that I gained through my read with the actual Parquet files representation, I used parquet-tools command with meta option for one of the sample Parquet file and it printed details with 3 major sections, Header, File schema and Row_groups. I understood the details presented under the first 2 sections but I couldn't completely understand all the details present in the row group section.

Below are the questions that I have.

Wanted to know more about what DO, FPO, VC (This looks like the count of all the rows in the current row group) is. Expansion of what it stands for can be found in the parquet-tools Github page but I wanted to get more details about it. I understand what SZ and ST is.
Next to ENC I see list of encoding schemes like BIT_PACKED, PLAIN,RLE. I understand what it means individually but I do not understand why there are at least 3 encoding schemes used all the times.
Next to Record count RC and total size TS of the row group, I see OFFSET. For the first page it is 4 always. How is it calculated?.
I came to know Parquet file's header and footer has 4 digit magic code as "PAR1", Does it have any special meaning? or just some arbitratry text to decide if the file is Parquet or not (without depending on the file extension).

Unfortunately I couldn't attach the snippet of the parquet-tools meta command's output due to security constraints but I hope it will not be too much to visualize what I mean in each of questions.

Upvotes: 5

Answers (2)

Anton_Chigur

Reputation: 11

I know this is an old one, but with nothing still documented anywhere I found this helpful.

In org.apache.parquet:parquet-cli:org/apache/parquet/cli/Util.java

    public static String encodingsAsString(Set<Encoding> encodings, ColumnDescriptor desc) {
    StringBuilder sb = new StringBuilder();
    if (encodings.contains(RLE) || encodings.contains(BIT_PACKED)) {
      sb.append(desc.getMaxDefinitionLevel() == 0 ? "B" : "R");
      sb.append(desc.getMaxRepetitionLevel() == 0 ? "B" : "R");
      if (encodings.contains(PLAIN_DICTIONARY)) {
        sb.append("R");
      }
      if (encodings.contains(PLAIN)) {
        sb.append("_");
      }
    } else {
      sb.append("RR");
      if (encodings.contains(RLE_DICTIONARY)) {
        sb.append("R");
      }
      if (encodings.contains(PLAIN)) {
        sb.append("_");
      }
      if (encodings.contains(DELTA_BYTE_ARRAY)
          || encodings.contains(DELTA_BINARY_PACKED)
          || encodings.contains(DELTA_LENGTH_BYTE_ARRAY)) {
        sb.append("D");
      }
    }
    return sb.toString();
  }

Upvotes: 1

wobu

Reputation: 341

This page has the best description i found: https://github.com/apache/parquet-mr/tree/master/parquet-tools-deprecated

So it seems tat DO, FPO are just offset infomation where the values of this particular columns starts. VC = Value count of existing non null values.

Parquet file creation with pandas

import pandas as pd

df = pd.DataFrame({
    'w1': ["John", "Max", "Hans"],
    'w2': ["Doe", "Mustermann", "Peter"],
    'w3': ["New York", "Berlin", "München"],
    'w4': [1990, 1980, 1970]})


df.to_parquet('./test_pandas.lz4.parquet', compression="lz4")

Meta output of parquet-tools.jar with java -jar ./parquet-tools-1.10.1.jar meta <file>

file schema: schema
--------------------------------------------------------------------------------
w1:          OPTIONAL BINARY O:UTF8 R:0 D:1
w2:          OPTIONAL BINARY O:UTF8 R:0 D:1
w3:          OPTIONAL BINARY O:UTF8 R:0 D:1
w4:          OPTIONAL INT64 R:0 D:1

row group 1: RC:3 TS:440 OFFSET:4
--------------------------------------------------------------------------------
w1:           BINARY LZ4 DO:4 FPO:51 SZ:98/79/0.81 VC:3 ENC:PLAIN_DICTIONARY,RLE,PLAIN ST:[min: Hans, max: Max, num_nulls: 0]
w2:           BINARY LZ4 DO:165 FPO:219 SZ:106/87/0.82 VC:3 ENC:PLAIN_DICTIONARY,RLE,PLAIN ST:[min: Doe, max: Peter, num_nulls: 0]
w3:           BINARY LZ4 DO:337 FPO:394 SZ:115/97/0.84 VC:3 ENC:PLAIN_DICTIONARY,RLE,PLAIN ST:[min: Berlin, max: New York, num_nulls: 0]
w4:           INT64 LZ4 DO:524 FPO:565 SZ:121/109/0.90 VC:3 ENC:PLAIN_DICTIONARY,RLE,PLAIN ST:[min: 1970, max: 1990, num_nulls: 0]

Upvotes: 0

Understanding Parquet File&#39;s metadata information printed with parquet-tools &quot;meta&quot; command

Answers (2)

Related Questions

Understanding Parquet File's metadata information printed with parquet-tools "meta" command