Misha AM
Misha AM

Reputation: 137

Avro DataFileReader needs a seekable file

The problem is stdin doesn't support seek which is needed by avro, so we read everything to buffer and then giving this to avro_wrapper. It works in Python 2, but doesn't work in Python 3. I have tried a few solutions but none of them are working.

# stdin doesn't support seek which is needed by avro... so this hack worked in python 2. This does not work in Python 3. 
# Reading everything to buffer and then giving this to avro_wrapper. 
buf = StringIO()
buf.write(args.input_file.read())
r = DataFileReader(buf, DatumReader())
# Very first record the headers information. Which gives the header names in order along with munge header names for all the record types
# For e.g if we have 2 ports then it will hold the header information of
#   1. port1 on name1 key
#   2. port2 on name2 key and so on 
headers_record = next(r)['headers']

The above produces UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf0 in position 17: invalid continuation byte error.

We then tried doing it this way:

input_stream = io.TextIOWrapper(args.input_file.buffer, encoding='latin-1')
sio = io.StringIO(input_stream.read())
r = DataFileReader(sio, DatumReader())
headers_record = next(r)['headers']

This produces avro.schema.AvroException: Not an Avro data file: Obj doesn't match b'Obj\x01'. error.

Another way:

input_stream = io.TextIOWrapper(args.input_file.buffer, encoding='latin-1')
buf = io.BytesIO(input_stream.read().encode('latin-1'))
r = DataFileReader(buf.read(), DatumReader())
headers_record = next(r)['headers']

This produces AttributeError: 'bytes' object has no attribute 'seek'" error.

Upvotes: 1

Views: 2055

Answers (1)

Martijn Pieters
Martijn Pieters

Reputation: 1121306

io.BytesIO() is the correct type to use to create a seekable in-memory file object containing binary data.

However, you made the mistake of reading out the bytes data from your io.BytesIO() file object, and passing those in instead of the actual file object.

Don't read, pass in the actual io.BytesIO file object with the binary data read from stdin:

buf = io.BytesIO(args.input_file.buffer.read())
r = DataFileReader(buf, DatumReader())

I passed in the args.input_file.buffer data directly, assuming that args.input is the TextIOWrapper instance that decodes the stdin bytes, and .buffer is the underlying BufferedReader instance providing the raw binary data. There is no point in decoding this data as Latin-1, then encoding as Latin-1 again. Just pass the bytes on.

Upvotes: 1

Related Questions