Reputation: 5739
I am looking for a performance improvement on a taks that I am doing.
The task is quite simple: from .txt file to a SQL database.
So the txt file consists in a bunch of lines looking like this one:
200101 35.922 2.127 1.182 1.182 1.418 1.654
Explanation:
200111
: is info, consists in 20
(channel num) 01
(page num) 11
(code)
the rest of the double values are just values: I1, I2...up to I6
Thus, the SQL file will have as a columns [channel, page, code, I1, I2, I3, I4, I5, I6, passed]
The problem is that, on the txt file, code
can be 00, 11, 10, 01, or 22 and, depending on the code, I need to perform one action or another with the values of I to decide passed=1 or passed=0. For example, in this case, if code=11
, passed=1 if I1>I3 and I6<1
The lines on the txt are sorted by code.
So, with that explanation, what I am basically doing is something like that:
with open(txtFile, 'r') as txt:
for line in txt:
currentLine = line.split(' ')[0]
if currentLine.endswith('00'):
#do some actions here
if currentLine.endswith('01'):
#do some actions here
#...
#and so on
# and of course write to SQL file
So, is it anything better or more time-efficinet than checking each line with an if
clause
Upvotes: 0
Views: 96
Reputation: 481
You might get some very slight improvement only doing a split once:
currentLine = line.split(' ', 1)[0]
Or if the first field you're interested in always has the same length (6 using your example), you could try to fetch only those characters:
currentLine = line[:6]
If the length of the first field is variable you could try this:
currentLine = line[:line.index(' ')]
Here's some timings to see which is faster...
Your current method:
# python3 -m timeit -s "l = '200101 35.922 2.127 1.182 1.182 1.418 1.654'" "lineCode = l.split(' ')[0]"
1000000 loops, best of 3: 0.61 usec per loop
First suggestion (limit split to one occurrence):
# python3 -m timeit -s "l = '200101 35.922 2.127 1.182 1.182 1.418 1.654'" "lineCode = l.split(' ', 1)[0]"
1000000 loops, best of 3: 0.237 usec per loop
Second suggestion (use slice to get fixed length field):
# python3 -m timeit -s "l = '200101 35.922 2.127 1.182 1.182 1.418 1.654'" "currentLine = l[:6]"
10000000 loops, best of 3: 0.0708 usec per loop
Third suggestion (use slice + index to get variable length field):
# python3 -m timeit -s "l = '200101 35.922 2.127 1.182 1.182 1.418 1.654'" "currentLine = l[:l.index(' ')]"
1000000 loops, best of 3: 0.208 usec per loop
In my elementary testing, it seems suggestion 2 is the fastest if you can manage it. The other two suggestions are very similar in performance but better than your current method by a decent margin.
Obviously these timings will vary depending on the platform you're running them on but relatively speaking the performance improvements should hold up anywhere.
Now, all that said, I agree with your other commentators that your slowness is probably coming from somewhere else. If I had to guess it would be your SQL INSERTs. The only thing I can suggest doing there is either a multiple INSERT if the database and driver allow it or writing your SQL statments to a properly formatted file and letting another tool do a bulk import (could even be called using Python subprocess module).
If you only need to test those two characters (the 5th and 6th) then this is the most efficient I found. It eliminates the inefficient split
you're using and the slower endswith
.
Yours:
# python3 -m timeit -s "l = '200101 35.922 2.127 1.182 1.182 1.418 1.654'" "currentLine = l.split(' ')[0]; currentLine.endswith('00')"
1000000 loops, best of 3: 0.72 usec per loop
Better:
# python3 -m timeit -s "l = '200101 35.922 2.127 1.182 1.182 1.418 1.654'" "currentLine = l[:6]; lineCode = currentLine[4:]; lineCode == '00'"
10000000 loops, best of 3: 0.161 usec per loop
Best:
# python3 -m timeit -s "l = '200101 35.922 2.127 1.182 1.182 1.418 1.654'" "currentLine = l[4:6]; currentLine == '00'"
10000000 loops, best of 3: 0.102 usec per loop
So, you could do this:
with open(txtFile, 'r') as txt:
for line in txt:
currentLine = line[4:6]
if currentLine == '00':
#do some actions here
elif currentLine == '01':
#do some actions here
#...
#and so on
# and of course write to SQL file
Upvotes: 1