codeKiller
codeKiller

Reputation: 5739

Performance improvement when reading txt file

I am looking for a performance improvement on a taks that I am doing.

The task is quite simple: from .txt file to a SQL database.

So the txt file consists in a bunch of lines looking like this one:

200101 35.922 2.127 1.182 1.182 1.418 1.654

Explanation:

200111: is info, consists in 20 (channel num) 01 (page num) 11 (code)

the rest of the double values are just values: I1, I2...up to I6

Thus, the SQL file will have as a columns [channel, page, code, I1, I2, I3, I4, I5, I6, passed]

The problem is that, on the txt file, code can be 00, 11, 10, 01, or 22 and, depending on the code, I need to perform one action or another with the values of I to decide passed=1 or passed=0. For example, in this case, if code=11, passed=1 if I1>I3 and I6<1

The lines on the txt are sorted by code.

So, with that explanation, what I am basically doing is something like that:

with open(txtFile, 'r') as txt: 
    for line in txt:
        currentLine = line.split(' ')[0]
        if currentLine.endswith('00'):
            #do some actions here
        if currentLine.endswith('01'):
            #do some actions here
        #...
        #and so on
        # and of course write to SQL file

So, is it anything better or more time-efficinet than checking each line with an ifclause

Upvotes: 0

Views: 96

Answers (1)

marklap
marklap

Reputation: 481

You might get some very slight improvement only doing a split once:

currentLine = line.split(' ', 1)[0]

Or if the first field you're interested in always has the same length (6 using your example), you could try to fetch only those characters:

currentLine = line[:6]

If the length of the first field is variable you could try this:

currentLine = line[:line.index(' ')]

Here's some timings to see which is faster...

Your current method:

# python3 -m timeit -s "l = '200101   35.922    2.127    1.182    1.182    1.418    1.654'" "lineCode = l.split(' ')[0]"
1000000 loops, best of 3: 0.61 usec per loop

First suggestion (limit split to one occurrence):

# python3 -m timeit -s "l = '200101   35.922    2.127    1.182    1.182    1.418    1.654'" "lineCode = l.split(' ', 1)[0]"
1000000 loops, best of 3: 0.237 usec per loop

Second suggestion (use slice to get fixed length field):

# python3 -m timeit -s "l = '200101   35.922    2.127    1.182    1.182    1.418    1.654'" "currentLine = l[:6]"                                                                                             
10000000 loops, best of 3: 0.0708 usec per loop

Third suggestion (use slice + index to get variable length field):

# python3 -m timeit -s "l = '200101   35.922    2.127    1.182    1.182    1.418    1.654'" "currentLine = l[:l.index(' ')]"
1000000 loops, best of 3: 0.208 usec per loop

In my elementary testing, it seems suggestion 2 is the fastest if you can manage it. The other two suggestions are very similar in performance but better than your current method by a decent margin.

Obviously these timings will vary depending on the platform you're running them on but relatively speaking the performance improvements should hold up anywhere.

Now, all that said, I agree with your other commentators that your slowness is probably coming from somewhere else. If I had to guess it would be your SQL INSERTs. The only thing I can suggest doing there is either a multiple INSERT if the database and driver allow it or writing your SQL statments to a properly formatted file and letting another tool do a bulk import (could even be called using Python subprocess module).


Additional Thoughts

If you only need to test those two characters (the 5th and 6th) then this is the most efficient I found. It eliminates the inefficient split you're using and the slower endswith.

Yours:

# python3 -m timeit -s "l = '200101   35.922    2.127    1.182    1.182    1.418    1.654'" "currentLine = l.split(' ')[0]; currentLine.endswith('00')"                                                       
1000000 loops, best of 3: 0.72 usec per loop

Better:

# python3 -m timeit -s "l = '200101   35.922    2.127    1.182    1.182    1.418    1.654'" "currentLine = l[:6]; lineCode = currentLine[4:]; lineCode == '00'"
10000000 loops, best of 3: 0.161 usec per loop

Best:

# python3 -m timeit -s "l = '200101   35.922    2.127    1.182    1.182    1.418    1.654'" "currentLine = l[4:6]; currentLine == '00'"                                                                       
10000000 loops, best of 3: 0.102 usec per loop

So, you could do this:

with open(txtFile, 'r') as txt: 
for line in txt:
    currentLine = line[4:6]
    if currentLine == '00':
        #do some actions here
    elif currentLine == '01':
        #do some actions here
    #...
    #and so on
    # and of course write to SQL file

Upvotes: 1

Related Questions