Nemori
Nemori

Reputation: 11

IO Performance of large files with Python 3, what to expect, bottlenecks?

How performant is slicing in Python 3?

I wrote a little program that processes 7.1GB of lines of words into a cut down format (infixes).

Currently I get a processing speed of about 1.3MB/sec on the file collection. I already some improvements using common guides but I feel that I might be missing the real performance hog. I suspect the bottleneck in the inFix() function's string slicing but haven't been to find a faster solution so far.

Am I missing some important concept or is Python just not suited for such a task (yet)?

    def inFix(w, infixSize):
        length = len(w)
        if length > infixSize :
            surround = length - infixSize
            pre = surround // 2
            suf = surround - pre
            return w[pre:-suf]
        else: 
            return w
def infix2g(line): return ("i %s %s %s\t%s\ni %s %s %s\t%s\ni %s %s %s\t%s\ni %s %s %s\t%s\ni %s %s %s\t%s\n" % ( inFix(line[1],1), inFix(line[2],1), inFix(line[3],1), line[0], inFix(line[1],2), inFix(line[2],2), inFix(line[3],2), line[0], inFix(line[1],3), inFix(line[2],3), inFix(line[3],3), line[0], inFix(line[1],4), inFix(line[2],4), inFix(line[3],4), line[0], inFix(line[1],5), inFix(line[2],5), inFix(line[3],5), line[0]))
uniOut = open('3-gram.txt', 'wt', encoding='iso-8859-15') for line in open('3-gram-1to5_infixes.txt', encoding='iso-8859-15'): line = line.split() if len(line) != 4: continue uniOut.write(infix2g(line))

uniOut.close()

Upvotes: 1

Views: 730

Answers (1)

ncoghlan
ncoghlan

Reputation: 41496

As per my comment, your best bet is to run a profiler over the script with a smaller data set to see where the time is going in relative terms.

However, there are a couple of easy optimisations you may look into as possible low hanging fruit.

Firstly, move the main processing loop inside a function in order to take advantage of optimised local variable access. This can make a surprisingly big difference.

Secondly, I would avoid the string formatting call and use a string join instead:

def infix2g(line):
    fragments = []
    for i in range(5):
        fragments.extend([
            'i ', inFix(line[1], i),
            ' ', inFix(line[2], i),
            ' ', inFix(line[3], i),
            '\t', line[0],
            '\n'])
    return ''.join(fragments)

Upvotes: 1

Related Questions