Reputation: 11
How performant is slicing in Python 3?
I wrote a little program that processes 7.1GB of lines of words into a cut down format (infixes).
Currently I get a processing speed of about 1.3MB/sec on the file collection. I already some improvements using common guides but I feel that I might be missing the real performance hog. I suspect the bottleneck in the inFix()
function's string slicing but haven't been to find a faster solution so far.
Am I missing some important concept or is Python just not suited for such a task (yet)?
def inFix(w, infixSize): length = len(w) if length > infixSize : surround = length - infixSize pre = surround // 2 suf = surround - pre return w[pre:-suf] else: return w
def infix2g(line): return ("i %s %s %s\t%s\ni %s %s %s\t%s\ni %s %s %s\t%s\ni %s %s %s\t%s\ni %s %s %s\t%s\n" % ( inFix(line[1],1), inFix(line[2],1), inFix(line[3],1), line[0], inFix(line[1],2), inFix(line[2],2), inFix(line[3],2), line[0], inFix(line[1],3), inFix(line[2],3), inFix(line[3],3), line[0], inFix(line[1],4), inFix(line[2],4), inFix(line[3],4), line[0], inFix(line[1],5), inFix(line[2],5), inFix(line[3],5), line[0]))
uniOut = open('3-gram.txt', 'wt', encoding='iso-8859-15') for line in open('3-gram-1to5_infixes.txt', encoding='iso-8859-15'): line = line.split() if len(line) != 4: continue uniOut.write(infix2g(line))uniOut.close()
Upvotes: 1
Views: 730
Reputation: 41496
As per my comment, your best bet is to run a profiler over the script with a smaller data set to see where the time is going in relative terms.
However, there are a couple of easy optimisations you may look into as possible low hanging fruit.
Firstly, move the main processing loop inside a function in order to take advantage of optimised local variable access. This can make a surprisingly big difference.
Secondly, I would avoid the string formatting call and use a string join instead:
def infix2g(line):
fragments = []
for i in range(5):
fragments.extend([
'i ', inFix(line[1], i),
' ', inFix(line[2], i),
' ', inFix(line[3], i),
'\t', line[0],
'\n'])
return ''.join(fragments)
Upvotes: 1