Reputation: 131
First, I conducted a search for this question. I found an answer for the C interface and one for Java. Didn't find one for C++. Unfortunately the methods invoked in the C example don't exist in the C++ API, so one couldn't merely mimic the answer provided in that particular stackoverflow discussion/topic.
I am attempting something that should be rather simple. Yet after an hour or two I have only managed to get closer to an answer and still haven't found one yet. In the interest of simplicity, I reduced the record that I am attempting to write to only 1 field. That field is a string that can be null. In Avro this means that the field is optional. The null aspect of the field is accomplished through an Avro union, where the convention is that the null value comes first in the schema for that field.
What I've learned thus far from a considerable amount of trial and error:
Should be simple enough. Yet not so much. For the particulars per the above list, the following comprise the code that I am currently using:
#ifndef RECURSIVE_HH
#define RECURSIVE_HH
#include "Specific.hh"
#include "Encoder.hh"
#include "Decoder.hh"
namespace recursive_record
{
struct recursive_data
{
std::string fstring;
};
}
namespace avro
{
template<> struct codec_traits<recursive_record::recursive_data>
{
static void encode( Encoder& e, const recursive_record::recursive_data& v )
{
avro::encode( e, v.fstring );
}
static void decode( Decoder& d, recursive_record::recursive_data& v )
{
avro::decode( d, v.fstring );
}
};
}
#endif /* RECURSIVE_HH */
{
"type": "record",
"name": "Root",
"fields": [
{
"name": "fstring",
"type": [
"null",
"string"
]
}
]
}
#include "recursive.h"
#include "Encoder.hh"
#include "Decoder.hh"
#include "Generic.hh"
#include "GenericDatum.hh"
#include "ValidSchema.hh"
#include "DataFile.hh"
#include "Types.hh"
#include "Compiler.hh"
#include "Stream.hh"
avro::ValidSchema loadSchema(const char* filename)
{
std::ifstream ifs(filename);
avro::ValidSchema result;
avro::compileJsonSchema(ifs, result);
return result;
}
int main( int argc, char** argv )
{
/**********************************************************************************
AVRO WRITER EXAMPLE
**********************************************************************************/
try
{
//Filename definitions skipped for brevity
avro::ValidSchema recursiveSchema = loadSchema( schemaFilename );
avro::DataFileWriter<recursive_record::recursive_data> dfw( filename, recursiveSchema );
recursive_record::recursive_data record;
record.fstring = std::string("First string");
dfw.write( record );
dfw.close();
}
catch( const std::exception& e )
{
// Log a message
return -1;
}
}
"So what's the problem?" you might ask. Well, the file is actually written successfully, at least in that the code doesn't crash and an Avro data file is produced. So far, so good. However, if you attempt to read that file, then you receive the following error:
AVRO read error: vector::_M_range_check: __n (which is 12) >= this->size() (which is 2)
Wha-??? Yeah. 'Been working on this all afternoon.
After considerable experimentation, I discovered that the problem was due to this nullable aspect of a given field. I also noticed that if I removed the nullable option from the schema, so that the schema becomes this:
{
"type": "record",
"name": "Root",
"fields": [
{
"name": "fstring",
"type": "string"
}
]
}
And I change nothing else, then the new Avro data file is not only written successfully, but it is read successfully too, thus:
[rh6lgn01][1881] MY_EXAMPLES/generate_recursive$ recursive
schema=recursive.json
file=./DATA/recursive.avro
recursiveSchema valid = true
ReadFile(): Type = record
ProcessRecord(): New record found. Field count = 1
ProcessRecord(): {
ProcessRecord(): Field 0: type = string
ProcessDatum(): Field 0: value = First string (length= 12)
ProcessRecord(): }
rowCount = 1
AVRO Writing and Reading Complete
[rh6lgn01][1882] MY_EXAMPLES/generate_recursive$
I had some hope when I read the Java issue. There was one answer that noted that - in Java - there is a @Nullable tag that you can associate with a field in a record. Here is a link to that issue: Storing null values in avro files
There is of course no such mechanism in the C++ language. I did find in the Types.hh header the following line of code that somehow seemed related:
/// define a type to identify Null in template functions
struct AVRO_DECL Null { };
However I couldn't make heads-nor-tails of how to use it in similar fashion. So I'm either missing something or it has a different purpose. I fear the former but suspect the latter.
And this is a link to the stackoverflow C issue, along with its answer, for completion: Write nullable item to avro record in Avro C
I am using version 1.9.2 of the Avro C++ library, running on a GNU/Linux box (not that it should matter, but for completion).
I will continue to prod and seek an answer, but if anyone has done this previously and can shed some light, I would appreciate the feedback.
Thanks!
Upvotes: 4
Views: 2921
Reputation: 131
Alright after toying with this until the wee-hours of the morning and all day today, I finally figured it out. So I thought I'd post an answer to my own question, in the event that someone else might be searching for the same information. Although I'll try to be brief, if you aren't into detail I'd suggest that you discontinue reading now.
In the end I discovered that there are two approaches one can take to resolve this issue. Both yield the same result, which is the ability to write data into a field/column in an Avro data file where that file has been declared as optional in the schema. That is, it has the "null union" attached to its type. I will begin my answer with the approach that is most related to the one I expressed in my original question. I will then provide an alternative solution and conclude with an observation or two. Note that in both of these approaches, the JSON schema remains unchanged from what you read in my initial post. The only items that changed were the header and the code body. Schema did not change. See my initial post for that content.
So the first approach. As with my first attempt, this approach involves the creation of a custom encoder and decoder (as shown in the header file in my original post), some JSON schema (mine was in a separate file) and then the primary body of code. To keep things short, the problem lied in the header, which I suspected. To fix that, you need to avoid writing that header yourself for anything beyond the most rudimentary scenarios; scenarios as shown in the examples that come with the Avro C++ distribution. Rather, you should let the Avro tool named "avrogencpp" do the heavy lifting in regard to creating the custom encoder/decoder. The reason I recommend making that choice is simply because the code that avrogencpp produced in that header is convoluted to say the least. Once you read it and understand it, the content makes sense, but for a record with more than a few fields at most the length becomes rather unwieldy for the human. Thus let machines do what they do best. Anyway, this was the command I used:
avrogencpp -i recursive.json -o recursive.h -n recursive_namespace
The result was a header that, nestled in its innards, had a struct definition named "Root", which matched the name of my top-level, or outermost, record as defined in the unchanged schema (no coincidence). And so with that, I could write the following in the main body of code:
avro::ValidSchema recursiveSchema = loadSchema( schemaFilename );
avro::DataFileWriter<recursive_namespace::Root> dfw( filename, recursiveSchema );
recursive_namespace::Root record;
// snipped for brevity
record.fstring.set_string( "String set via direct record value assignment" );
dfw.write( record );
dfw.close();
This would be successful, as seen in the output:
[rh6lgn01][2174] MY_EXAMPLES/generate_recursive$ recursive
schema=recursive.json
file=./DATA/recursive.avro
recursiveSchema valid = 1
ReadFile(): Enter
ReadFile(): Type = record
ProcessRecord(): New record found. Field count = 1
ProcessRecord(): {
ProcessRecord(): Field 0: type = string
ProcessDatum(): Field 0: value = String set via direct record value assignment (length = 45)
ProcessRecord(): }
rowCount = 1
-----------------------
AVRO Writing and Reading Complete
[rh6lgn01][2175] MY_EXAMPLES/generate_recursive$
And so that's that. Now to the second approach. This uses the GenericDatum class and is similar to the problem and answer shown in this stackoverflow discussion:
How to read data from AVRO file using C++ interface?
In a way one could argue that this approach has benefit in that you don't need a custom encoder/decoder and thus don't need the avrogencpp tool either. While that is true, I must admit to wondering about the performance of using the generic "interface" in Avro. 'Just seems like it might be a tad slower than the direct route. However, it can read any file and is thus more flexible. I digress. Back to the solution. The only code you need is in the main body. Granted, what I am about to present is snipped to the bare essentials in order to demonstrate the approach. Therefore in-real-life you would need to flesh it out to include other types, etc. However it will convey the idea, which is all you need. And this is it:
avro::DataFileWriter<avro::GenericDatum> writer( filename, schema );
avro::GenericDatum datum( schema );
if( avro::AVRO_RECORD == datum.type() )
{
avro::GenericRecord &record = datum.value<avro::GenericRecord>();
for( uint32_t i = 0; i < record.fieldCount(); i++ )
{
avro::GenericDatum &fieldDatum = record.fieldAt( i );
// So if the datum is a union, then it's likely that
// the datum is an optional field. We'd need to flesh
// this out considerably to ensure that this was indeed
// the case, but for brevity reasons, this will work:
if( true == fieldDatum.isUnion() )
{
// Assuming the well-known Avro convention of the null
// being first in the optional "syntax", then merely
// jump to the second field which has the "actual type"
// that the field/column is supposed to represent.
// Again, this is in dire need of fleshing-out...
fieldDatum.selectBranch( 1 );
switch( fieldDatum.type() )
{
case avro::AVRO_STRING:
{
std::string &newValue = fieldDatum.value<std::string>();
newValue = "New string set via switching branches in the union";
break;
}
}
}
writer.write( datum );
}
writer.close();
This variant produces the following:
[rh6lgn01][2177] MY_EXAMPLES/generate_recursive$ recursive
schema=recursive.json
file=./DATA/recursive.avro
Top level was a record
The record had 1 fields.
Field datum was a union = true
Field datum 0 was a union. Current branch = 0
Field datum 0 is now a string. Current branch = 1
ReadFile(): Enter
ReadFile(): Type = record
ProcessRecord(): New record found. Field count = 1
ProcessRecord(): {
ProcessRecord(): Field 0: type = string
ProcessDatum(): Field 0: value = New string set via switching branches in the union (length = 50)
ProcessRecord(): }
rowCount = 1
-----------------------
AVRO Writing and Reading Complete
[rh6lgn01][2178] MY_EXAMPLES/generate_recursive$
And so it is a satisfactory solution as well.
For me, I'll likely go with the latter approach, as it just somehow seems "cleaner." That said, I think that the more-correct reason is that I use the generic "interface" to do the reading of Avro files, and so using it again for the purpose of writing seems more-congruent. In addition I prefer the second approach due to the lack of need to use avrogencpp. YMMV.
I hope this answer helps someone in the future. Best of luck!
Jerry
Upvotes: 6