norok2
norok2

Reputation: 26886

Determine the type of the result of `file.read()` from `file` in Python

I have some code that operates on a file object in Python.

Following Python3's string/bytes revolution, if file was opened in binary mode, file.read() returns bytes. Conversely if file was opened in text mode, file.read() returns str.

In my code, file.read() is called multiple times and therefore it is not practical to check for the result-type every time I call file.read(), e.g.:

def foo(file_obj):
    while True:
        data = file.read(1)
        if not data:
            break
        if isinstance(data, bytes):
            # do something for bytes
            ...
        else:  # isinstance(data, str)
            # do something for str
            ...

What I would like to have instead is some ways of reliably checking what the result of file.read() will be, e.g.:

def foo(file_obj):
    if is_binary_file(file_obj):
        # do something for bytes
        while True:
            data = file.read(1)
            if not data:
                break
            ...
    else:
        # do something for str
        while True:
            data = file.read(1)
            if not data:
                break
            ...

A possible way would be to check file_obj.mode e.g.:

import io


def is_binary_file(file_obj):
    return 'b' in file_obj.mode


print(is_binary_file(open('test_file', 'w')))
# False
print(is_binary_file(open('test_file', 'wb')))
# True
print(is_binary_file(io.StringIO('ciao')))
# AttributeError: '_io.StringIO' object has no attribute 'mode'
print(is_binary_file(io.BytesIO(b'ciao')))
# AttributeError: '_io.BytesIO' object has no attribute 'mode'

which would fail for the objects from io like io.StringIO() and io.BytesIO().


Another way, which would also work for io objects, would be to check for the encoding attribute, e.g:

import io


def is_binary_file(file_obj):
    return not hasattr(file_obj, 'encoding')


print(is_binary_file(open('test_file', 'w')))
# False
print(is_binary_file(open('test_file', 'wb')))
# True
print(is_binary_file(io.StringIO('ciao')))
# False 
print(is_binary_file(io.BytesIO(b'ciao')))
# True

Is there a cleaner way to perform this check?

Upvotes: 3

Views: 3155

Answers (2)

norok2
norok2

Reputation: 26886

After a bit more homework, I can probably answer my own question.

First of all, a general remark: checking for the presence/absence of an attribute/method as a hallmark for the whole API is not a good idea because it will lead to more complex and still relatively unsafe code.

Following the EAFP/duck-typing mindset it may be OK to check for a specific method, but it should be the one used subsequently in the code.

The problem with file.read() (and even more so with file.write()) is that it comes with side-effects that make it unpractical to just try using it and see what happens.

For this specific case, while still following the duck-typing mindset, one could exploit the fact that the first parameter of read() can be set to 0. This will not actually read anything from the buffer (and it will not change the result of file.tell()), but it will give an empty str or bytes. Hence, one could write something like:

def is_reading_bytes(file_obj):
    return isinstance(file_obj.read(0), bytes)


print(is_reading_bytes(open('test_file', 'r')))
# False
print(is_reading_bytes(open('test_file', 'rb')))
# True
print(is_reading_bytes(io.StringIO('ciao')))
# False 
print(is_reading_bytes(io.BytesIO(b'ciao')))
# True

Similarly, one could try writing an empty bytes string b'' for the write() method:

def is_writing_bytes(file_obj)
    try:
        file_obj.write(b'')
    except TypeError:
        return False
    else:
        return True


print(is_writing_bytes(open('test_file', 'w')))
# False
print(is_writing_bytes(open('test_file', 'wb')))
# True
print(is_writing_bytes(io.StringIO('ciao')))
# False 
print(is_writing_bytes(io.BytesIO(b'ciao')))
# True

Note that those methods will not check for readability / writability.


Finally, one could implement a proper type checking approach by inspecting the the file-like object API. A file-like object in Python must support the API described in the io module. In the documentation it is mentioned that TextIOBase is used for files opened in text mode, while BufferedIOBase (or RawIOBase for unbuffered streams) is used for files opened in binary mode. The class hierarchy summary indicates that are both subclassed from IOBase. Hence the following will do the trick (remember that isinstance() checks for subclasses too):

def is_binary_file(file_obj):
    return isinstance(file_obj, io.IOBase) and not isinstance(file_obj, io.TextIOBase)


print(is_binary_file(open('test_file', 'w')))
# False
print(is_binary_file(open('test_file', 'wb')))
# True
print(is_binary_file(open('test_file', 'r')))
# False
print(is_binary_file(open('test_file', 'rb')))
# True
print(is_binary_file(io.StringIO('ciao')))
# False 
print(is_binary_file(io.BytesIO(b'ciao')))
# True

Note that the documentation explicitly says that TextIOBase will have a encoding parameter, which is not required (i.e. it is not there) for binary file objects. Hence, with the current API, checking on the encoding attribute may be a handy hack to check if the file object is binary for standard classes, under the assumption the the tested object is file-like. Checking the mode attribute would only work for FileIO objects and the mode attribute is not part of the IOBase / RawIOBase interface, and that is why it does not work on io.StringIO() / is.BytesIO() objects.

Upvotes: 0

Iguananaut
Iguananaut

Reputation: 23296

I have a version of this in astropy (for Python 3, though a Python 2 version can be found in older versions of Astropy if needed for some reason).

It's not pretty, but it works reliably enough for most cases (I took out the part that checks for a .binary attribute since that's only applicable to a class in Astropy):

def fileobj_is_binary(f):
    """
    Returns True if the give file or file-like object has a file open in binary
    mode.  When in doubt, returns True by default.
    """

    if isinstance(f, io.TextIOBase):
        return False

    mode = fileobj_mode(f)
    if mode:
        return 'b' in mode
    else:
        return True

where fileobj_mode is:

def fileobj_mode(f):
    """
    Returns the 'mode' string of a file-like object if such a thing exists.
    Otherwise returns None.
    """

    # Go from most to least specific--for example gzip objects have a 'mode'
    # attribute, but it's not analogous to the file.mode attribute

    # gzip.GzipFile -like
    if hasattr(f, 'fileobj') and hasattr(f.fileobj, 'mode'):
        fileobj = f.fileobj

    # astropy.io.fits._File -like, doesn't need additional checks because it's
    # already validated
    elif hasattr(f, 'fileobj_mode'):
        return f.fileobj_mode

    # PIL-Image -like investigate the fp (filebuffer)
    elif hasattr(f, 'fp') and hasattr(f.fp, 'mode'):
        fileobj = f.fp

    # FILEIO -like (normal open(...)), keep as is.
    elif hasattr(f, 'mode'):
        fileobj = f

    # Doesn't look like a file-like object, for example strings, urls or paths.
    else:
        return None

    return _fileobj_normalize_mode(fileobj)


def _fileobj_normalize_mode(f):
    """Takes care of some corner cases in Python where the mode string
    is either oddly formatted or does not truly represent the file mode.
    """
    mode = f.mode

    # Special case: Gzip modes:
    if isinstance(f, gzip.GzipFile):
        # GzipFiles can be either readonly or writeonly
        if mode == gzip.READ:
            return 'rb'
        elif mode == gzip.WRITE:
            return 'wb'
        else:
            return None  # This shouldn't happen?

    # Sometimes Python can produce modes like 'r+b' which will be normalized
    # here to 'rb+'
    if '+' in mode:
        mode = mode.replace('+', '')
        mode += '+'

    return mode

You might also want to add a special case for io.BytesIO. Again, ugly, but works for most cases. Would be great if there were a simpler way.

Upvotes: 2

Related Questions