ragingasiancoder
ragingasiancoder

Reputation: 600

How does readline() work behind the scenes when reading a text file?

I would like to understand how readline() takes in a single line from a text file. The specific details I would like to know about, with respect to how the compiler interprets the Python language and how this is handled by the CPU, are:

  1. How does the readline() know which line of text to read, given that successive calls to readline() read the text line by line?
  2. Is there a way to start reading a line of text from the middle of a text? How would this work with respect to the CPU?

I am a "beginner" (I have about 4 years of "simpler" programming experience), so I wouldn't be able to understand technical details, but feel free to expand if it could help others understand!

Upvotes: 5

Views: 5898

Answers (2)

xgord
xgord

Reputation: 4806

Example using the file file.txt:

fake file
with some text
in a few lines

Question 1: How does the readline() know which line of text to read, given that successive calls to readline() read the text line by line?

When you open a file in python, it creates a file object. File objects act as file descriptors, which means at any one point in time, they point to a specific place in the file. When you first open the file, that pointer is at the beginning of the file. When you call readline(), it moves the pointer forward to the character just after the next newline it reads.

Calling the tell() function of a file object returns the location the file descriptor is currently pointing to.

with open('file.txt', 'r') as fd:
    print fd.tell()
    fd.readline()
    print fd.tell()

# output:
0
10
# Or 11, depending on the line separators in the file


Question 2: Is there a way to start reading a line of text from the middle of a text? How would this work with respect to the CPU?

First off, reading a file doesn't really have anything to do with the CPU. It has to do with the operating system and the file system. Both of those determine how files can be read and written to. Barebones explanation of files

For random access in files, you can use the mmap module of python. The Python Module of the Week site has a great tutorial.

Example, jumping to the 2nd line in the example file and reading until the end:

import mmap
import contextlib

with open('file.txt', 'r') as fd:
    with contextlib.closing(mmap.mmap(fd.fileno(), 0, access=mmap.ACCESS_READ)) as mm:
        print mm[10:]

# output:
with some text
in a few lines

Upvotes: 5

Fabian Fagerholm
Fabian Fagerholm

Reputation: 4139

This is a very broad question and it's unlikely that all details about what the CPU does would fit in an answer. But a high-level answer is possible:

  1. readline reads each line in order. It starts by reading chunks of the file from the beginning. When it encounters a line break, it returns that line. Each successive invocation of readline returns the next line until the last line has been read. Then it returns an empty string.

    with open("myfile.txt") as f:
        while True:
            line = f.readline()
            if not line:
                break
            # do something with the line
    

    Readline uses operating system calls under the hood. The file object corresponds to a file descriptor in the OS, and it has a pointer that keeps track of where in the file we are at the moment. The next read will return the next chunk of data from the file from that point on.

  2. You would have to scan through the file first in order to know how many lines there are, and then use some way of starting from the "middle" line. If you meant some arbitrary line except the first and last lines, you would have to scan the file from the beginning identifying lines (for example, you could repeatedly call readline, throwing away the result), until you have reached the line you want). There is a ready-made module for this: linecache.

    import linecache
    
    linecache.getline("myfile.txt", 5) # we already know we want line 5
    

Upvotes: 2

Related Questions