Karan Kumar
Karan Kumar

Reputation: 43

How to read ORC file column data

I have downloaded ORC c++ API and built it on my Ubuntu. Now I am trying to read its columns data in batch. In this reference it is mentioned that orc::ColumnVectorBatch can be dynamic_cast to specific column data type batch Like : orc::Decimal64VectorBatch. But it is not giving null pointer as dynamic cast result. Below is my code:

// Orc Reader.

#include <memory>
#include <iostream>
#include <vector>
#include <list>
#include <fstream>

#include <orc/Reader.hh>
#include <orc/ColumnPrinter.hh>
#include <orc/Exceptions.hh>
#include <orc/OrcFile.hh>

int main(int argc, char const *argv[])
{
    std::list<uint64_t> read_cols = {4};
    std::string file_path = "~/trades_data.zlib.orc";

    std::ifstream in_file(file_path.c_str(), std::ios::binary);
    in_file.seekg(0, std::ios::end);
    int file_size = in_file.tellg();
    std::cout << "Size of the file is" << " " << file_size << " " << "bytes";

    orc::RowReaderOptions row_reader_opts;
    row_reader_opts.include(read_cols);

    orc::ReaderOptions reader_opts;
    std::unique_ptr<orc::Reader> reader;
    std::unique_ptr<orc::RowReader> row_reader;

    reader = orc::createReader(orc::readFile(file_path), reader_opts);
    row_reader = reader->createRowReader(row_reader_opts);

    std::unique_ptr<orc::ColumnVectorBatch> batch = row_reader->createRowBatch(1000);

    while (row_reader->next(*batch))
    {
        // BELOW LINE OF CODE IS GIVING NULLPOINTER.
        orc::Decimal64VectorBatch *dec_vec = dynamic_cast<orc::Decimal64VectorBatch*>(batch.get());
    }

    return 0;
}

It is really a big help for me if someone could point out the error.

Upvotes: 1

Views: 1215

Answers (2)

ideal
ideal

Reputation: 151

This is my method, hope it can help you.

full code demo: https://github.com/harbby/cmake_ExternalProject_demo

//double field
auto *fields = dynamic_cast<orc::StructVectorBatch *>(batch.get());
auto *col0 = dynamic_cast<orc::DoubleVectorBatch *>(fields->fields[0]);
double *buffer1 = col0->data.data();

//string field
auto *col4 = dynamic_cast<orc::StringVectorBatch *>(fields->fields[4]);
char **buffer2 = col4->data.data();
long *lengths = col4->length.data();

while (row_reader->next(*batch)) {
    for (uint32_t r = 0; r < batch->numElements; ++r) {
        std::cout << "line " << buffer1[r] << "," << std::string(buffer2[r], lengths[r]) << "\n";
    }
    //std::cout << "this batch nums" << " " << batch->numElements << " " << "lines\n";
}

Upvotes: 0

Karan Kumar
Karan Kumar

Reputation: 43

I have resolved this problem a while ago and now I am writing an answer to my own question. Hope also helps in your code as well. In code above it is trying to convert the bath which it reads from row_reader into an orc::Decimal64VectorBatch but the batch should first get converted into orc::StructVectorBatch. Then using the index number of columns it can be easily converted into required column data.

    const int time_idx = 0; // Index of column containing time in decimal64 format.
    while (row_reader->next(*batch))
    {
        // Now batch should initially convert into StructVectorBatc.
        const auto &struct_batch = dynamic_cast<const orc::StructVectorBatch&>(*batch.get());
        // And then struct_batch can be converted into required column data format.
        const auto &dec_vec = dynamic_cast<orc::Decimal64VectorBatch&>(*(struct_batch.fields[time_idx)).values.data();
    }

Upvotes: 1

Related Questions