Shravan40
Shravan40

Reputation: 9888

How can I get the row view of data read from parquet file?

Example: Let's say a table name user has id, name, email, phone, and is_active as attributes. And there are 1000s of users part of this table. I would like to read the details per user.

void ParquetReaderPlus::read_next_row(long row_group_index, long local_row_num)
{
    std::vector<int> columns_to_tabulate(this->total_row);
    for (int idx = 0; idx < this->total_row; idx++)
        columns_to_tabulate[idx] = idx;

    this->file_reader->set_num_threads(4);
    int rg = this->total_row_group;

    // Read into table as row group rather than the whole Parquet file.
    std::shared_ptr<arrow::Table> table;
    this->file_reader->ReadRowGroup(row_group_index, columns_to_tabulate, &table);
    auto rows = table->num_rows();
    //TODO
    // Now I am confused how to proceed from here
}

Any suggestions?

I am confused if converting the ColumnarTableToVector will work?

Upvotes: 2

Views: 1741

Answers (2)

Pace
Pace

Reputation: 43817

It's difficult to answer this question without knowing what you plan on doing with those details. A Table has a list of columns and each column (in Arrow-C++) has a type-agnostic array of data. Since the columns are type-agnostic there is not much you can do with them other than get the count and access the underlying bytes.

If you want to interact with the values then you will either need to know the type of a column ahead of time (and cast), have a series of different actions for each different type of data you might encounter (switch case plus cast), or interact with the values as buffers of bytes. One could probably write a complete answer for all three of those options.

You might want to read up a bit on the Arrow compute API (https://arrow.apache.org/docs/cpp/compute.html although the documentation is a bit sparse for C++). This API allows you to perform some common operations on your data (somewhat) regardless of type. For example, I see the word "tabulate" in your code snippet. If you wanted to sum up the values in a column then you could use the "sum" function in the compute API. This function follows the "have a series of different actions for each different type of data you might encounter" advice above and will allow you to sum up any numeric column.

Upvotes: 1

0x26res
0x26res

Reputation: 13902

As far as I know what you are trying to do isn't easy. You'd have to:

  • iterate through each row
  • iterate through each column
  • figure out the type of the column
  • cast the arrow::Array of the column to the underlying type (eg: arrow::StringArray)
  • get the value for that column, convert it to string and append it to your output

This is further complciated by:

  • the fact that the rows are grouped in chunked (so iterating through rows isn't as simple)
  • you also need to deal with list and struct types.

It's not impossible, it's a lot of code (but you'd only have to write it once).

Another option is to write that table to CSV in memory and print it:

arrow::Status dumpTable(const std::shared_ptr<arrow::Table>& table) {
  auto outputResult = arrow::io::BufferOutputStream::Create();
  ARROW_RETURN_NOT_OK(outputResult.status());
  std::shared_ptr<arrow::io::BufferOutputStream> output = outputResult.ValueOrDie();
  ARROW_RETURN_NOT_OK(arrow::csv::WriteCSV(*table, arrow::csv::WriteOptions::Defaults(), output.get()));
  auto finishResult = output->Finish();
  ARROW_RETURN_NOT_OK(finishResult.status());
  std::cout << finishResult.ValueOrDie()->ToString();
  return arrow::Status::OK();
}

Upvotes: 0

Related Questions