Reputation: 1790
Avro schema schema.avsc
:
{
"namespace": "standard",
"type": "record",
"name": "agent",
"aliases":["agents"],
"fields": [
{
"name": "id",
"type": ["string", "null"]
},
{
"name": "name",
"type": ["string", "null"],
"aliases":["title", "nickname"]
}
]
}
Python script main.py
:
from fastavro import writer, reader
from fastavro.schema import load_schema
schema = load_schema('schema.avsc')
avro_data = 'agent.avro'
data = jsonlines.open('data.jsonl')
with open(avro_data, 'wb') as fout:
writer(fout, schema, data, validator=True)
with open(avro_data, 'rb') as fin:
for i in reader(fin, schema):
print(i)
When my json lines data.jsonl
file looks like this:
{"id":"1","name":"foo"}
{"id":"2","name":"bar"}
My python script returns:
{'id': '1', 'name': 'foo'}
{'id': '2', 'name': 'bar'}
However, if my json lines data.jsonl
file looks like this:
{"id":"1","title":"foo"}
{"id":"2","title":"bar"}
My python script returns:
{'id': '1', 'name': None}
{'id': '2', 'name': None}
Any idea why the name
column isn't respecting the aliases
attribute I've defined in the avro schema file for that particular field?
Upvotes: 1
Views: 663
Reputation: 2074
Aliases are used when you have data written with an old schema that you want to read with a new schema. Your example only uses one schema, so aliases wouldn't work with just a single schema.
Let's use the following two schemas in an example. Here's an "old" schema which uses the title
field:
old_schema.avsc
{
"namespace": "standard",
"type": "record",
"name": "agent",
"aliases":["agents"],
"fields": [
{
"name": "id",
"type": ["string", "null"]
},
{
"name": "title",
"type": ["string", "null"]
}
]
}
And a new schema where we want the new name
field to be an alias of the old title
field:
new_schema.avsc
{
"namespace": "standard",
"type": "record",
"name": "agent",
"aliases":["agents"],
"fields": [
{
"name": "id",
"type": ["string", "null"]
},
{
"name": "name",
"type": ["string", "null"],
"aliases":["title"]
}
]
}
If we use your second data.jsonl
which looks like this:
{"id":"1","title":"foo"}
{"id":"2","title":"bar"}
Then we can use a slightly modified version of your main.py
so that the data is written with the old schema and then the new schema is passed to the reader
so that the aliases are respected:
from fastavro import writer, reader
from fastavro.schema import load_schema
import jsonlines
old_schema = load_schema('old_schema.avsc')
new_schema = load_schema('new_schema.avsc')
avro_data = 'agent.avro'
data = jsonlines.open('data.jsonl')
# Data is writen with old schema
with open(avro_data, 'wb') as fout:
writer(fout, old_schema, data, validator=True)
# And read with new schema
with open(avro_data, 'rb') as fin:
for i in reader(fin, new_schema):
print(i)
Now the output is correct:
{'id': '1', 'name': 'foo'}
{'id': '2', 'name': 'bar'}
Upvotes: 2