Reputation: 1224

Increasing understanding of the Python read() function

I have the following Python 3 script:

from sys import argv

script, filename = argv

txt = open(filename)

print(f"Here's your file {filename}:")

print(txt.read())

When we use the built in function open() we open the file and return a corresponding file object.

I understand that read() is not a built-in function, but a method of file object.

As stated here in the Python docs about file objects https://docs.python.org/3/glossary.html#term-file-object:

There are actually three categories of file objects: raw binary files, buffered binary files and text files. Their interfaces are defined in the io module.

I'm really struggling to understand a few key areas.

1) How do I know which file object type I will be working with of raw binary, buffered binary and text files? In this example I am using a simple .txt file, so I would assume the file object would be a text file.

2) How do I know which specific read() method I am calling when I use the io module? Which class is it part of, as multiple classes have the read method available

Please keep answers as simple as possible as I'm fairly new to Python. I just don't understand the documentation for the io module very well. I quickly become lost from step 3 onwards and need this explaining to me in simple steps.

I'm making a real effort to understand the logical steps to navigate the documentation, so please amend these steps as appropriate.

My understanding is as follows:

We call the built in open() function
This opens a file and returns a corresponding file object.
We then use the io module to work with the file object.
Establish what category of file object we are using, in this case I believe it is Text I/O
Text IO states 'The text stream API is described in detail in the documentation of TextIOBase.'
The class io.TextIOBase is used which has various methods such as read() available.

Upvotes: 0

Answers (4)

bruno desthuilliers

Reputation: 77902

In this example I am using a simple .txt file, so I would assume the file object would be a text file."

This is totally unrelated.

The extension is only a naming convention. It has absolutely nothing to do with the effective content - which from a purely technical POV is always made of bytes anyway (the difference is about how you interpret those bytes) -, and it has nothing to do with which IO class open() will use either, cf deceze's complete and excellent answer.

Upvotes: 0

Jens

Reputation: 9130

How do I know which file object type I will be working with of raw binary, buffered binary and text files? In this example I am using a simple .txt file, so I would assume the file object would be a text file.

You don’t. But there are ways to identify/guess a file’s content type quite similar to Linux’s file command. For example, take a look at the python-magic package:

import magic
m = magic.Magic(mime=True)    
print(m.from_file(filename))

This would give you the MIME type of a file, e.g. application/json and then you’d know whether to read it as a text or binary file.

Whether you’re reading the text or binary file buffered or not, depends on how you open it, see also the io module.

The other answers provide more details on the IO, so I’m not going into this here… 😉

Upvotes: 0

PEdroArthur

Reputation: 884

It is all about how you open the file.

If you call open(path), you will open path as a text file object. If you call open(path, 'rb'), you will open as a buffered binary. If you call open(path, 'rb', buffering=0), you will open as a unbuffered binary. Simple as that =)

Please refer to https://docs.python.org/3/library/io.html for more information.

Upvotes: 0

deceze

Reputation: 522075

There are certain things which are identical between any file object, and you can see that in the class hierarchy. All of the file objects have IOBase as their base class, which defines things which are common to all file objects. It then specialises into RawIOBase, BufferedIOBase and TextIOBase classes, which then further specialise into FileIO and BytesIO and whatnot. It's a typical OOP class hierarchy.

What they all have in common is that they all define a read method. What that method does differs slightly in the details, but the overall function is the same: it reads from whatever the underlying data is and returns that data. That's typical OOP abstraction/encapsulation/polymorphism: you don't need to care how it does it or what exactly it does, you just need to know that you call .read() to get data.

You could instantiate those classes individually, but you typically use open to simplify that potentially complex task. open decides which class to return to you based on what exactly you requested:

Text I/O

Text I/O expects and produces str objects. This means that whenever the backing store is natively made of bytes (such as in the case of a file), encoding and decoding of data is made transparently as well as optional translation of platform-specific newline characters.

The easiest way to create a text stream is with open(), optionally specifying an encoding:
f = open("myfile.txt", "r", encoding="utf-8")

Binary I/O

Binary I/O (also called buffered I/O) expects bytes-like objects and produces bytes objects. No encoding, decoding, or newline translation is performed. [...]

The easiest way to create a binary stream is with open() with 'b' in the mode string:
f = open("myfile.jpg", "rb")

Raw I/O

Raw I/O (also called unbuffered I/O) is generally used as a low-level building-block for binary and text streams; it is rarely useful to directly manipulate a raw stream from user code. Nevertheless, you can create a raw stream by opening a file in binary mode with buffering disabled:
f = open("myfile.jpg", "rb", buffering=0)

Upvotes: 3

Increasing understanding of the Python read() function

Answers (4)

Text I/O

Binary I/O

Raw I/O

Related Questions