Reputation: 7164
In am looping over the elements of an Arrow Array and trying to apply a compute function to each scalar that will tell me the year, month, day, etc... of each element. The code looks something like this:
arrow::NumericArray<arrow::Date32Type> array = {...}
for (int64_t i = 0; i < array.length(); i++) {
arrow::Result<std::shared_ptr<arrow::Scalar>> result = array->GetScalar(i);
if (!result.ok()) {
// TODO: handle error
}
arrow::Result<arrow::Datum> year = arrow::compute::Year(*result);
}
However, I am not really clear as to how to extract the actual int64_t value from the arrow::compute::Year
call. I have tried to do things like
const std::shared_ptr<int64_t> val = year.ValueOrDie();
>>> 'arrow::Datum' to non-scalar type 'const std::shared_ptr<long int>' requested
I've tried similarly to assign to just an int64_t
which also fails with error: cannot convert 'arrow::Datum' to 'int64_t'
I didn't see any method of the Datum
class that would otherwise return a scalar value in the primitive type that I think arrow::compute::Year
should be returning. Any idea what I might be misunderstanding with the Datum / Scalar / Compute APIs?
Upvotes: 1
Views: 1649
Reputation: 43817
Arrow's compute functions are really meant to be applied on arrays and not scalars, otherwise the overhead renders the operation rather inefficient. The arrow::compute::Year
function takes in a Datum
. This is a convenience item that could be a Scalar, an Array, ArrayData, RecordBatch, or Table. Not all functions accept all possible values of Datum (in particular, many do not accept RecordBatch or Table).
Once you have a result, there are a few ways you can get the data, and grabbing individual scalars is probably going to be the least efficient, especially if you know the type of the data ahead of time (in this case we know the type will be int64_t). This is because a scalar is meant to be a type-erased wrapper (e.g. like an "object" in python or java) around some value and it carries some overhead.
So my suggestion would be:
// If you are going to be passing your array through the compute
// infrastructure you'll need to have it in a shared_ptr.
// Also, NumericArray is a base class so you don't often need
// to refer to it directly. You'll typically be getting one of the
// concrete subclasses like Date32Array
std::shared_ptr<arrow::Date32Array> array = {...}
// A datum can be implicitly constructed from a shared_ptr to an
// array. You could also explicitly construct it if that is more
// comfortable to you. Here `array` is being implicitly cast to a Datum.
ARROW_ASSIGN_OR_RAISE(arrow::Datum year_datum, arrow::compute::Year(array));
// Now we have a datum, but the docs tell us the return value from the
// `Year` function is always an array, so lets just unwrap it. This is
// something that could probably be improved in Arrow (might as well
// return an array)
std::shared_ptr<arrow::Array> years_arr = year_datum.make_array();
// Also, we know that the data type is Int64 so let's go ahead and
// cast further
std::shared_ptr<arrow::Int64Array> years = std::dynamic_pointer_cast<arrow::Int64Array>(years_arr);
// The concrete classes can be iterated in a variety of ways. GetScalar
// is the least efficient (but doesn't require knowing the type up front)
// Since we know the type (we've cast to Int64Array) we can use Value
// to get a single int64_t, raw_values() to get a const int64_t* (e.g a
// C-style array) or, perhaps the simplest, begin() and end() to get STL
// compliant iterators of int64_t
for (int64_t year : years) {
std::cout << "Year: " << year << std::endl;
}
If you really want to work with scalars:
arrow::Array array = {...}
for (int64_t i = 0; i < array.length(); i++) {
arrow::Result<std::shared_ptr<arrow::Scalar>> result = array->GetScalar(i);
if (!result.ok()) {
// TODO: handle error
}
ARROW_ASSIGN_OR_RAISE(Datum year_datum, arrow::compute::Year(*result));
std::shared_ptr<arrow::Scalar> year_scalar = year_datum.scalar();
std::shared_ptr<arrow::Int64Scalar> year_scalar_int = std::dynamic_pointer_cast<arrow::Int64Scalar>(year_scalar);
int64_t year = year_scalar_int->value;
}
Upvotes: 1