Persistent variable in Python function

Question

I've got this script written which does some involved parsing while running through a large file. For every line in the file (after some heavy manipulation), I need to add a check to see whether it meets certain criteria and if it does, include it in a list for additional processing later.

The function doing the parsing is already a bit cluttered, and I'm wondering whether it's possible to shunt the line-checking and list manipulation to another function to make things easier to modify later on. My inclination is to use a global variable which gets modified by the function, but I know that's usually poor form. Until now I have never made use of classes, but I vaguely recall there being some advantage to them regarding persistent local variables.

One version of this part of the script could be:

matchingLines = []
def lineProcess(line):
    global matchingLines
    if line.startswith(criteria):
        matchingLines.append(line)
for line in myFile:
    # lots of other stuff
    lineProcess(line)

Obviously in this simple example it's not much of a pain to just do the checking in the main function and not bother with any additional functions. But in principle, I'm wondering what a better general way of doing this sort of thing is without using external variables.

EDIT: Part of the reason I find the separate function attractive is because once I've collected the list of lines, I am going to use them to manipulate another external file and it would be convenient to have the whole operate wrapped up in a contained module. However, I recognize that this might be premature optimization.

Tom Dalton · Accepted Answer

Something like the following might be considered more pythonic:

def is_valid_line(line):
    """Return True if the line is valid, else False."""
    return line.startswith(criteria)

valid_lines = [l for l in myFile if is_valid_line(l)]

Incidentally it would be even better practice to use a generator expression rather than a list e.g.

valid_lines = (l for l in myFile if is_valid_line(l))

That way the file reading and line-validation will only actually happen when something tries to iterate over valid_lines, and not before. E.g. In the following case:

valid_lines = [l for l in myFile if is_valid_line(l)]
for line in valid lines:
    stuff_that_can_raise_exception(line)

In this case, you have read and validated the entire (huge) file and have the full list of validated lines, and then the first line causes an error, the time spent validating the whole file is wasted. If you use a generator expression (the (x for x in y)) version instead of a list comprehension (the [x for x in y]) then when the error happens you haven't actually validated the file yet (only the first line). I only mention it because I am terrible for not doing this more often myself, and it can yield big performance gains (in CPU and memory) in a lot of cases.

Persistent variable in Python function

Answers (2)

Related Questions