user1154422
user1154422

Reputation: 656

How to optimize the C++ Parquet ReadBatch method

I want to optimize the reading of a column in Parquet using the ReadBatch method.

You pass in the number of rows to read:

int cnt = reader->ReadBatch(10, nullptr, nullptr, &value, &values_read);

In this case, I am asking for 10 and the actual number read is the return value.

Is there a way to get the number of rows in the Row Group before the read?

Upvotes: 0

Views: 364

Answers (1)

user1154422
user1154422

Reputation: 656

Use the meta-data method at the FileReader or RowGroup to get # of rows:

 // Total Rows for Parquet File 
  std::unique_ptr<parquet::ParquetFileReader> parquet_reader = ...;
  std::shared_ptr<parquet::FileMetaData> file_metadata = parquet_reader->metadata();
  int total_num_rows = file_metadata->num_rows();

  // Rows for specific Row Group
  std::shared_ptr<parquet::RowGroupReader> row_group_reader = ...; 
  auto rgMetaData = row_group_reader->metadata();
  int rowGroupNumRows = rgMetaData->num_rows();

Upvotes: 1

Related Questions