Reputation:
I have to compute the data from a large file. File has around 100000 rows and 3 columns. The Program below works great with a small test file but when trying to run with a large file it takes ages to display even one result. Any suggestions to speed the loading and computing of large data file.
Code: Computation is perfect with small test file, input format given below
from collections import defaultdict
paircount = defaultdict(int)
pairtime = defaultdict(float)
pairper = defaultdict(float)
#get number of pair occrences and total time
with open('input.txt', 'r') as f:
with open('output.txt', 'w') as o:
numline = 0
for line in f:
numline += 1
line = line.split()
pair = line[0], line[1]
paircount[pair] += 1
pairtime[pair] += float(line[2])
pairper = dict((pair, c * 100.0 / numline) for (pair, c) in paircount.iteritems())
for pair, c in paircount.iteritems():
#print pair[0], pair[1], c, pairper[pair], pairtime[pair]
o.write("%s, %s, %s, %s, %s\n" % (pair[0], pair[1], c, pairper[pair], pairtime[pair]))
Inputfile:
5372 2684 460.0
1885 1158 351.0
1349 1174 6375.0
1980 1174 650.0
1980 1349 650.0
4821 2684 469.0
4821 937 459.0
2684 937 318.0
1980 606 390.0
1349 606 750.0
1174 606 750.0
Upvotes: 0
Views: 89
Reputation: 123393
The primary cause of the slowness is because you recreate theperpair
dictionary for each line from thepaircount
dictionary which grows larger and larger, which isn't necessary because only the value computed after all the lines are processed is ever used.
I don't fully understand what all the computations are, but here's something equivalent that should run much faster because it only creates thepairper
dictionary once. I also simplified the logic a bit, although that probably didn't effect the run time very much either way, but I think it's easier to understand.
from collections import defaultdict
paircount = defaultdict(int)
pairtime = defaultdict(float)
#get number of pair occurrences and total time
with open('easy_input.txt', 'r') as f, open('easy_output.txt', 'w') as o:
for numline, line in enumerate((line.split() for line in f), start=1):
pair = line[0], line[1]
paircount[pair] += 1
pairtime[pair] += float(line[2])
pairper = dict((pair, c * 100.0 / numline) for (pair, c)
in paircount.iteritems())
for pair, c in paircount.iteritems():
#print pair[0], pair[1], c, pairper[pair], pairtime[pair]
o.write("%s, %s, %s, %s, %s\n" % (pair[0], pair[1], c,
pairper[pair], pairtime[pair]))
print 'done'
Upvotes: 1
Reputation: 77337
The pairper calculation is killing you and is not needed. You can use enumerate to count the input lines and just use that value at the end. This is similar to martineau's answer except that it doesn't pull the entire input list into memory (bad idea) or even calcuate pairper at all.
from collections import defaultdict
paircount = defaultdict(int)
pairtime = defaultdict(float)
#get number of pair occrences and total time
with open('input.txt', 'r') as f:
with open('output.txt', 'w') as o:
for numline, line in enumerate(f, 1):
line = line.split()
pair = line[0], line[1]
paircount[pair] += 1
pairtime[pair] += float(line[2])
for pair, c in paircount.iteritems():
#print pair[0], pair[1], c, pairper[pair], pairtime[pair]
o.write("%s, %s, %s, %s, %s\n" % (pair[0], pair[1], c, c * 100.0 / numline, pairtime[pair]))
Upvotes: 1