gbeaven
gbeaven

Reputation: 1790

Avro schema not respecting alias in schema definition

Avro schema schema.avsc:

{
    "namespace": "standard",
    "type": "record",
    "name": "agent",
    "aliases":["agents"],
    "fields": [
        {
            "name": "id",
            "type": ["string", "null"]
        },
        {
            "name": "name",
            "type": ["string", "null"],
            "aliases":["title", "nickname"]
        }
    ]
}

Python script main.py:

from fastavro import writer, reader
from fastavro.schema import load_schema

schema = load_schema('schema.avsc')
avro_data = 'agent.avro'
data = jsonlines.open('data.jsonl')

with open(avro_data, 'wb') as fout:
    writer(fout, schema, data, validator=True)

with open(avro_data, 'rb') as fin:
    for i in reader(fin, schema):
        print(i)

When my json lines data.jsonl file looks like this:

{"id":"1","name":"foo"}
{"id":"2","name":"bar"}

My python script returns:

{'id': '1', 'name': 'foo'}
{'id': '2', 'name': 'bar'}

However, if my json lines data.jsonl file looks like this:

{"id":"1","title":"foo"}
{"id":"2","title":"bar"}

My python script returns:

{'id': '1', 'name': None}
{'id': '2', 'name': None}

Any idea why the name column isn't respecting the aliases attribute I've defined in the avro schema file for that particular field?

Upvotes: 1

Views: 663

Answers (1)

Scott
Scott

Reputation: 2074

Aliases are used when you have data written with an old schema that you want to read with a new schema. Your example only uses one schema, so aliases wouldn't work with just a single schema.

Let's use the following two schemas in an example. Here's an "old" schema which uses the title field:

old_schema.avsc

{
    "namespace": "standard",
    "type": "record",
    "name": "agent",
    "aliases":["agents"],
    "fields": [
        {
            "name": "id",
            "type": ["string", "null"]
        },
        {
            "name": "title",
            "type": ["string", "null"]
        }
    ]
}

And a new schema where we want the new name field to be an alias of the old title field:

new_schema.avsc

{
    "namespace": "standard",
    "type": "record",
    "name": "agent",
    "aliases":["agents"],
    "fields": [
        {
            "name": "id",
            "type": ["string", "null"]
        },
        {
            "name": "name",
            "type": ["string", "null"],
            "aliases":["title"]
        }
    ]
}

If we use your second data.jsonl which looks like this:

{"id":"1","title":"foo"}
{"id":"2","title":"bar"}

Then we can use a slightly modified version of your main.py so that the data is written with the old schema and then the new schema is passed to the reader so that the aliases are respected:

from fastavro import writer, reader
from fastavro.schema import load_schema
import jsonlines

old_schema = load_schema('old_schema.avsc')
new_schema = load_schema('new_schema.avsc')
avro_data = 'agent.avro'
data = jsonlines.open('data.jsonl')

# Data is writen with old schema
with open(avro_data, 'wb') as fout:
    writer(fout, old_schema, data, validator=True)

# And read with new schema
with open(avro_data, 'rb') as fin:
    for i in reader(fin, new_schema):
        print(i)

Now the output is correct:

{'id': '1', 'name': 'foo'}
{'id': '2', 'name': 'bar'}

Upvotes: 2

Related Questions