J. Adair
J. Adair

Reputation: 131

How to write data in Avro with the C++ interface when the field is nullable?

First, I conducted a search for this question. I found an answer for the C interface and one for Java. Didn't find one for C++. Unfortunately the methods invoked in the C example don't exist in the C++ API, so one couldn't merely mimic the answer provided in that particular stackoverflow discussion/topic.

I am attempting something that should be rather simple. Yet after an hour or two I have only managed to get closer to an answer and still haven't found one yet. In the interest of simplicity, I reduced the record that I am attempting to write to only 1 field. That field is a string that can be null. In Avro this means that the field is optional. The null aspect of the field is accomplished through an Avro union, where the convention is that the null value comes first in the schema for that field.

What I've learned thus far from a considerable amount of trial and error:

  1. You need an encoder and decoder within a templated codec_traits struct for the record you want to write. This is typically defined in a header somewhere.
  2. If loading the schema from a file, which I am doing, then you need that schema defined in JSON format in a separate file.
  3. In your C++ code, you declare an avro::DataFileWriter using the schema that you load, along with a record from the aforementioned header. You then have a local record that you populate with your data and then you invoke the write() method.

Should be simple enough. Yet not so much. For the particulars per the above list, the following comprise the code that I am currently using:

  1. The header:
    #ifndef RECURSIVE_HH
    #define RECURSIVE_HH
    
    #include "Specific.hh"
    #include "Encoder.hh"
    #include "Decoder.hh"
    
    namespace recursive_record
    {
       struct recursive_data
       {
          std::string   fstring;
    
       };
    }
    
    namespace avro
    {
       template<> struct codec_traits<recursive_record::recursive_data>
       {
          static void encode( Encoder& e, const recursive_record::recursive_data& v )
          {
             avro::encode( e, v.fstring );
    
          }
    
          static void decode( Decoder& d, recursive_record::recursive_data& v )
          {
             avro::decode( d, v.fstring );
    
          }
       };
    }
    
    #endif /* RECURSIVE_HH */
  1. The JSON schema file:
    {
        "type": "record",
        "name": "Root",
        "fields": [
            {
                "name": "fstring",
                "type": [
                    "null",
                    "string"
                ]
            }
        ]
    }
  1. The main C++ file (note that I have snipped the file for brevity reasons, thus some of the included headers aren't used (or rather seen) in the following code:
    #include "recursive.h"
    #include "Encoder.hh"
    #include "Decoder.hh"
    #include "Generic.hh"
    #include "GenericDatum.hh"
    #include "ValidSchema.hh"
    #include "DataFile.hh"
    #include "Types.hh"
    #include "Compiler.hh"
    #include "Stream.hh"
    
    avro::ValidSchema loadSchema(const char* filename)
    {
        std::ifstream ifs(filename);
        avro::ValidSchema result;
        avro::compileJsonSchema(ifs, result);
        return result;
    }
    
    
    int main( int argc, char** argv )
    {
       /**********************************************************************************
                                  AVRO WRITER EXAMPLE
       **********************************************************************************/
       try
       {
          //Filename definitions skipped for brevity
    
          avro::ValidSchema          recursiveSchema = loadSchema( schemaFilename );
          avro::DataFileWriter<recursive_record::recursive_data>   dfw( filename, recursiveSchema );
          recursive_record::recursive_data       record;
          record.fstring = std::string("First string");
    
          dfw.write( record );
          dfw.close();
    
       }
       catch( const std::exception& e )
       {
          // Log a message
          return -1;
    
       }
    }

"So what's the problem?" you might ask. Well, the file is actually written successfully, at least in that the code doesn't crash and an Avro data file is produced. So far, so good. However, if you attempt to read that file, then you receive the following error:

    AVRO read error: vector::_M_range_check: __n (which is 12) >= this->size() (which is 2)

Wha-??? Yeah. 'Been working on this all afternoon.

After considerable experimentation, I discovered that the problem was due to this nullable aspect of a given field. I also noticed that if I removed the nullable option from the schema, so that the schema becomes this:

    {
        "type": "record",
        "name": "Root",
        "fields": [
            {
                "name": "fstring",
                "type": "string"
            }
        ]
    }

And I change nothing else, then the new Avro data file is not only written successfully, but it is read successfully too, thus:

    [rh6lgn01][1881] MY_EXAMPLES/generate_recursive$ recursive
    schema=recursive.json
    file=./DATA/recursive.avro
    recursiveSchema valid = true
    ReadFile(): Type = record
    ProcessRecord(): New record found.  Field count = 1
    ProcessRecord(): {
    ProcessRecord():   Field 0: type = string
    ProcessDatum():   Field 0: value = First string (length= 12)
    ProcessRecord(): }
    rowCount = 1
    
    AVRO Writing and Reading Complete
    [rh6lgn01][1882] MY_EXAMPLES/generate_recursive$

I had some hope when I read the Java issue. There was one answer that noted that - in Java - there is a @Nullable tag that you can associate with a field in a record. Here is a link to that issue: Storing null values in avro files

There is of course no such mechanism in the C++ language. I did find in the Types.hh header the following line of code that somehow seemed related:

    /// define a type to identify Null in template functions
    struct AVRO_DECL Null { };

However I couldn't make heads-nor-tails of how to use it in similar fashion. So I'm either missing something or it has a different purpose. I fear the former but suspect the latter.

And this is a link to the stackoverflow C issue, along with its answer, for completion: Write nullable item to avro record in Avro C

I am using version 1.9.2 of the Avro C++ library, running on a GNU/Linux box (not that it should matter, but for completion).

I will continue to prod and seek an answer, but if anyone has done this previously and can shed some light, I would appreciate the feedback.

Thanks!

Upvotes: 4

Views: 2921

Answers (1)

J. Adair
J. Adair

Reputation: 131

Alright after toying with this until the wee-hours of the morning and all day today, I finally figured it out. So I thought I'd post an answer to my own question, in the event that someone else might be searching for the same information. Although I'll try to be brief, if you aren't into detail I'd suggest that you discontinue reading now.

In the end I discovered that there are two approaches one can take to resolve this issue. Both yield the same result, which is the ability to write data into a field/column in an Avro data file where that file has been declared as optional in the schema. That is, it has the "null union" attached to its type. I will begin my answer with the approach that is most related to the one I expressed in my original question. I will then provide an alternative solution and conclude with an observation or two. Note that in both of these approaches, the JSON schema remains unchanged from what you read in my initial post. The only items that changed were the header and the code body. Schema did not change. See my initial post for that content.

So the first approach. As with my first attempt, this approach involves the creation of a custom encoder and decoder (as shown in the header file in my original post), some JSON schema (mine was in a separate file) and then the primary body of code. To keep things short, the problem lied in the header, which I suspected. To fix that, you need to avoid writing that header yourself for anything beyond the most rudimentary scenarios; scenarios as shown in the examples that come with the Avro C++ distribution. Rather, you should let the Avro tool named "avrogencpp" do the heavy lifting in regard to creating the custom encoder/decoder. The reason I recommend making that choice is simply because the code that avrogencpp produced in that header is convoluted to say the least. Once you read it and understand it, the content makes sense, but for a record with more than a few fields at most the length becomes rather unwieldy for the human. Thus let machines do what they do best. Anyway, this was the command I used:

avrogencpp -i recursive.json -o recursive.h -n recursive_namespace

The result was a header that, nestled in its innards, had a struct definition named "Root", which matched the name of my top-level, or outermost, record as defined in the unchanged schema (no coincidence). And so with that, I could write the following in the main body of code:

      avro::ValidSchema          recursiveSchema = loadSchema( schemaFilename );
      avro::DataFileWriter<recursive_namespace::Root>   dfw( filename, recursiveSchema );
      recursive_namespace::Root  record;
      // snipped for brevity
      record.fstring.set_string( "String set via direct record value assignment" );
      dfw.write( record );
      dfw.close();

This would be successful, as seen in the output:

[rh6lgn01][2174] MY_EXAMPLES/generate_recursive$ recursive
schema=recursive.json
file=./DATA/recursive.avro
recursiveSchema valid = 1
ReadFile(): Enter
ReadFile(): Type = record
ProcessRecord(): New record found.  Field count = 1
ProcessRecord(): {
ProcessRecord():   Field 0: type = string
ProcessDatum():   Field 0: value = String set via direct record value assignment (length = 45)
ProcessRecord(): }
rowCount = 1
-----------------------

AVRO Writing and Reading Complete
[rh6lgn01][2175] MY_EXAMPLES/generate_recursive$

And so that's that. Now to the second approach. This uses the GenericDatum class and is similar to the problem and answer shown in this stackoverflow discussion:

How to read data from AVRO file using C++ interface?

In a way one could argue that this approach has benefit in that you don't need a custom encoder/decoder and thus don't need the avrogencpp tool either. While that is true, I must admit to wondering about the performance of using the generic "interface" in Avro. 'Just seems like it might be a tad slower than the direct route. However, it can read any file and is thus more flexible. I digress. Back to the solution. The only code you need is in the main body. Granted, what I am about to present is snipped to the bare essentials in order to demonstrate the approach. Therefore in-real-life you would need to flesh it out to include other types, etc. However it will convey the idea, which is all you need. And this is it:

      avro::DataFileWriter<avro::GenericDatum>   writer( filename, schema );
      avro::GenericDatum    datum( schema );

      if( avro::AVRO_RECORD == datum.type() )
      {
         avro::GenericRecord  &record = datum.value<avro::GenericRecord>();
         for( uint32_t i = 0; i < record.fieldCount(); i++ )
         {
            avro::GenericDatum &fieldDatum = record.fieldAt( i );

            // So if the datum is a union, then it's likely that
            // the datum is an optional field.  We'd need to flesh
            // this out considerably to ensure that this was indeed
            // the case, but for brevity reasons, this will work:
            if( true == fieldDatum.isUnion() )
            {
                // Assuming the well-known Avro convention of the null
                // being first in the optional "syntax", then merely
                // jump to the second field which has the "actual type"
                // that the field/column is supposed to represent.
                // Again, this is in dire need of fleshing-out...
                fieldDatum.selectBranch( 1 );
                switch( fieldDatum.type() )
                {
                    case avro::AVRO_STRING:
                    {
                       std::string &newValue = fieldDatum.value<std::string>();
                       newValue = "New string set via switching branches in the union";
                       break;
                    }
                }
            }
            writer.write( datum );
      }
      writer.close();

This variant produces the following:

[rh6lgn01][2177] MY_EXAMPLES/generate_recursive$ recursive
schema=recursive.json
file=./DATA/recursive.avro
Top level was a record
The record had 1 fields.
Field datum was a union = true
Field datum 0 was a union.  Current branch = 0
Field datum 0 is now a string.  Current branch = 1
ReadFile(): Enter
ReadFile(): Type = record
ProcessRecord(): New record found.  Field count = 1
ProcessRecord(): {
ProcessRecord():   Field 0: type = string
ProcessDatum():   Field 0: value = New string set via switching branches in the union (length = 50)
ProcessRecord(): }
rowCount = 1
-----------------------

AVRO Writing and Reading Complete
[rh6lgn01][2178] MY_EXAMPLES/generate_recursive$

And so it is a satisfactory solution as well.

For me, I'll likely go with the latter approach, as it just somehow seems "cleaner." That said, I think that the more-correct reason is that I use the generic "interface" to do the reading of Avro files, and so using it again for the purpose of writing seems more-congruent. In addition I prefer the second approach due to the lack of need to use avrogencpp. YMMV.

I hope this answer helps someone in the future. Best of luck!

Jerry

Upvotes: 6

Related Questions