Reputation: 2656
I need to process a very large CSV file.
During the process the first line needs some special attention.
So, the obvious code would be to check for the value in the csv-line. But that means a string-compare for every line (around 200.000)
Another option would be to set a boolean and let the boolean compare come first in an 'or' expression.
Both options are below:
import csv
def do_extra_processing():
pass
def do_normal_processing():
pass
if __name__ == "__main__":
with open('file.csv', newline='') as csvfile:
lines = csv.reader(csvfile, delimiter=';')
line_checked: bool = False
for line in lines:
# Check the first line: Option 1
if line[1] == "SomeValue":
# Every line of the 200000 lines does the string-compare
do_extra_processing()
do_normal_processing()
# Check the first line: Option 2
if (line_checked) or (line[1] == "SomeValue"):
# Every line of the 200000 lines does the boolean-compare first and does not evaluate the string compare
do_extra_processing()
line_checked = True
do_normal_processing()
I've checked that in an 'or' expression, the second part is not evaluated when the first part is True.
The boolean is initialized just above the for-loop and set in the if-statement when the extra_processing is done.
The question is: Is the second option with the bool-compare significantly faster?
(No need to convert to , so different question than 37615264 )
Upvotes: 1
Views: 643
Reputation: 27609
(Edit/note: This applies to what I think the OP's code is intended to do, not what it actually does. I've asked whether it's a bug like I suspect.)
What the original version does:
line
.1
.line[1]
.What the bool-optimized version does:
line_checked
.Which is faster? Take a guess :-). But better still measure, you might find that neither matters, i.e., that both are much faster than the remaining actual processing per line.
Anyway, here are two ideas that need no extra work for the lines after the first:
with open('file.csv', newline='') as csvfile:
lines = csv.reader(csvfile, delimiter=';')
for line in lines:
if line[1] == "SomeValue":
do_extra_processing()
do_normal_processing()
break
for line in lines:
do_normal_processing()
with open('file.csv', newline='') as csvfile:
lines = csv.reader(csvfile, delimiter=';')
def process():
if line[1] == "SomeValue":
do_extra_processing()
do_normal_processing()
nonlocal process
process = do_normal_processing
for line in lines:
process()
Not tested. The latter solution might need global
instead of nonlocal
if you keep that code block in the global space. Might be a good idea to put it in a function, though.
A little benchmark: If you have a bug as I suspect, and the bool is intended to avoid the string comparison and extra processing for all but the first line, then I get times like these:
11.5 ms 11.6 ms 11.6 ms if is_first_line and line[1] == "Somevalue": doesnt_happen_in_other_lines
45.1 ms 45.3 ms 45.3 ms if line[1] == "Somevalue": doesnt_happen_in_other_lines
Code (Try it online!):
from timeit import repeat
setup = '''
is_first_line = False
line = [None, "Othervalue"]
'''
statements = [
'if is_first_line and line[1] == "Somevalue": doesnt_happen_in_other_lines',
'if line[1] == "Somevalue": doesnt_happen_in_other_lines',
]
for _ in range(3):
for stmt in statements:
ts = sorted(repeat(stmt, setup))[:3]
print(*('%4.1f ms ' % (t * 1e3) for t in ts), stmt)
print()
Upvotes: 3
Reputation: 148965
Before further tests I would have advised to use the second version because we all know that testing a boolean is simpler that testing string equality.
Then I did what I advised @AidenEllis to do (Python 3.10 on Windows), and was kind of amazed:
timeit('x = "foo" if a == b else "bar"', '''a=True
b=False
''')
0.031938999999511
timeit('x = "foo" if a == b else "bar"', '''a=True
b=True
''')
0.032499900000402704
timeit('x = "foo" if a == b else "bar"', '''a="Somevalue"
b="Somevalue1"
''')
0.03237569999964762
Nothing really significant...
Then I tried:
timeit('x = "foo" if a else "bar"', 'a=True')
0.022047000000384287
timeit('x = "foo" if a else "bar"', 'a=False')
0.020898400000078254
Close to 30% faster, looks good...
And finaly
timeit('x = "foo" if (a or (b == c)) else "bar"', '''a=True
b="Somevalue"
c="Somevalue1"
''')
0.022851300000183983
Still significant but it means that testing a boolean is faster than comparing 2 values whatever the type of the values, even if they are boolean. Not really what I expected...
My conclusion is that we are playing on implementation details (the reason why I gave the Python version) and that the only sensible answer is it does not really matter: the gain if any should be negligible compared to the real processing time.
Upvotes: 1
Reputation: 56
Might not be the Quick answer you're looking for but why don't you just compare the process time of both, doing both ways individually and then checking which finished faster.
Use this if you just want to quickly compare 2 different sets of code :
import time
start = time.perf_counter()
# do your processes here
finish = time.perf_counter()
total = finish - start
print(f"Process Time: {round(total * 1000, 2)}ms")
Upvotes: 0