Reputation: 2656

Python bool compare vs string compare, which is faster?

I need to process a very large CSV file.

During the process the first line needs some special attention.

So, the obvious code would be to check for the value in the csv-line. But that means a string-compare for every line (around 200.000)

Another option would be to set a boolean and let the boolean compare come first in an 'or' expression.

Both options are below:

import csv

def do_extra_processing():
    pass


def do_normal_processing():
    pass


if __name__ == "__main__":

    with open('file.csv', newline='') as csvfile:
        lines = csv.reader(csvfile, delimiter=';')

        line_checked: bool = False

        for line in lines:
            # Check the first line: Option 1
            if line[1] == "SomeValue":
                # Every line of the 200000 lines does the string-compare
                do_extra_processing()
            do_normal_processing()

            # Check the first line: Option 2
            if (line_checked) or (line[1] == "SomeValue"):
                # Every line of the 200000 lines does the boolean-compare first and does not evaluate the string compare
                do_extra_processing()
                line_checked = True
            do_normal_processing()

I've checked that in an 'or' expression, the second part is not evaluated when the first part is True.

The boolean is initialized just above the for-loop and set in the if-statement when the extra_processing is done.

The question is: Is the second option with the bool-compare significantly faster?

(No need to convert to , so different question than 37615264 )

Upvotes: 1

Answers (3)

Kelly Bundy

Reputation: 27609

(Edit/note: This applies to what I think the OP's code is intended to do, not what it actually does. I've asked whether it's a bug like I suspect.)

What the original version does:

Load line.
Load 1.
Load line[1].
Load a string constant.
Do a string comparison, resulting in a bool.
Check the truth of a bool.

What the bool-optimized version does:

Load line_checked.
Check the truth of a bool.

Which is faster? Take a guess :-). But better still measure, you might find that neither matters, i.e., that both are much faster than the remaining actual processing per line.

Anyway, here are two ideas that need no extra work for the lines after the first:

Separate code:

    with open('file.csv', newline='') as csvfile:
        lines = csv.reader(csvfile, delimiter=';')

        for line in lines:
            if line[1] == "SomeValue":
                do_extra_processing()
            do_normal_processing()
            break

        for line in lines:
            do_normal_processing()

Switch the processing function after the first line:

    with open('file.csv', newline='') as csvfile:
        lines = csv.reader(csvfile, delimiter=';')

        def process():
            if line[1] == "SomeValue":
                do_extra_processing()
            do_normal_processing()
            nonlocal process
            process = do_normal_processing
            
        for line in lines:
            process()

Not tested. The latter solution might need global instead of nonlocal if you keep that code block in the global space. Might be a good idea to put it in a function, though.

A little benchmark: If you have a bug as I suspect, and the bool is intended to avoid the string comparison and extra processing for all but the first line, then I get times like these:

11.5 ms  11.6 ms  11.6 ms  if is_first_line and line[1] == "Somevalue": doesnt_happen_in_other_lines
45.1 ms  45.3 ms  45.3 ms  if line[1] == "Somevalue": doesnt_happen_in_other_lines

Code (Try it online!):

from timeit import repeat

setup = '''
is_first_line = False
line = [None, "Othervalue"]
'''

statements = [
    'if is_first_line and line[1] == "Somevalue": doesnt_happen_in_other_lines',
    'if line[1] == "Somevalue": doesnt_happen_in_other_lines',
]

for _ in range(3):
    for stmt in statements:
        ts = sorted(repeat(stmt, setup))[:3]
        print(*('%4.1f ms ' % (t * 1e3) for t in ts), stmt)
    print()

Upvotes: 3

Serge Ballesta

Reputation: 148965

Before further tests I would have advised to use the second version because we all know that testing a boolean is simpler that testing string equality.

Then I did what I advised @AidenEllis to do (Python 3.10 on Windows), and was kind of amazed:

timeit('x = "foo" if a == b else "bar"', '''a=True
b=False
''')
0.031938999999511
timeit('x = "foo" if a == b else "bar"', '''a=True
b=True
''')
0.032499900000402704
timeit('x = "foo" if a == b else "bar"', '''a="Somevalue"
b="Somevalue1"
''')
0.03237569999964762

Nothing really significant...

Then I tried:

timeit('x = "foo" if a else "bar"', 'a=True')
0.022047000000384287
timeit('x = "foo" if a else "bar"', 'a=False')
0.020898400000078254

Close to 30% faster, looks good...

And finaly

timeit('x = "foo" if (a or (b == c)) else "bar"', '''a=True
b="Somevalue"
c="Somevalue1"
''')
0.022851300000183983

Still significant but it means that testing a boolean is faster than comparing 2 values whatever the type of the values, even if they are boolean. Not really what I expected...

My conclusion is that we are playing on implementation details (the reason why I gave the Python version) and that the only sensible answer is it does not really matter: the gain if any should be negligible compared to the real processing time.

Upvotes: 1

Aiden Ellis

Reputation: 56

Might not be the Quick answer you're looking for but why don't you just compare the process time of both, doing both ways individually and then checking which finished faster.

Use this if you just want to quickly compare 2 different sets of code :

import time

start = time.perf_counter()

# do your processes here

finish = time.perf_counter()
total = finish - start
print(f"Process Time: {round(total * 1000, 2)}ms")

aight, back to it XD

Upvotes: 0

Python bool compare vs string compare, which is faster?

Answers (3)

aight, back to it XD

Related Questions