user3964336
user3964336

Reputation:

Easy way to compute Large data file python

I have to compute the data from a large file. File has around 100000 rows and 3 columns. The Program below works great with a small test file but when trying to run with a large file it takes ages to display even one result. Any suggestions to speed the loading and computing of large data file.

Code: Computation is perfect with small test file, input format given below

from collections import defaultdict
paircount = defaultdict(int)
pairtime = defaultdict(float)
pairper = defaultdict(float)

#get number of pair occrences and total time 
with open('input.txt', 'r') as f:
  with open('output.txt', 'w') as o: 
    numline = 0
    for line in f:
        numline += 1
            line = line.split()
        pair = line[0], line[1]
        paircount[pair] += 1
        pairtime[pair] += float(line[2])
        pairper = dict((pair, c * 100.0 / numline) for (pair, c) in paircount.iteritems())

    for pair, c in paircount.iteritems():
        #print pair[0], pair[1], c, pairper[pair], pairtime[pair]
        o.write("%s, %s, %s, %s, %s\n" % (pair[0], pair[1], c, pairper[pair], pairtime[pair]))

Inputfile:

5372 2684 460.0
1885 1158 351.0
1349 1174 6375.0
1980 1174 650.0
1980 1349 650.0
4821 2684 469.0
4821 937  459.0
2684 937  318.0
1980 606  390.0
1349 606  750.0
1174 606  750.0

Upvotes: 0

Views: 89

Answers (2)

martineau
martineau

Reputation: 123393

The primary cause of the slowness is because you recreate theperpairdictionary for each line from thepaircountdictionary which grows larger and larger, which isn't necessary because only the value computed after all the lines are processed is ever used.

I don't fully understand what all the computations are, but here's something equivalent that should run much faster because it only creates thepairperdictionary once. I also simplified the logic a bit, although that probably didn't effect the run time very much either way, but I think it's easier to understand.

from collections import defaultdict
paircount = defaultdict(int)
pairtime = defaultdict(float)

#get number of pair occurrences and total time
with open('easy_input.txt', 'r') as f, open('easy_output.txt', 'w') as o:
    for numline, line in enumerate((line.split() for line in f), start=1):
        pair = line[0], line[1]
        paircount[pair] += 1
        pairtime[pair] += float(line[2])

    pairper = dict((pair, c * 100.0 / numline) for (pair, c)
                                                in paircount.iteritems())
    for pair, c in paircount.iteritems():
        #print pair[0], pair[1], c, pairper[pair], pairtime[pair]
        o.write("%s, %s, %s, %s, %s\n" % (pair[0], pair[1], c,
                                          pairper[pair], pairtime[pair]))
print 'done'

Upvotes: 1

tdelaney
tdelaney

Reputation: 77337

The pairper calculation is killing you and is not needed. You can use enumerate to count the input lines and just use that value at the end. This is similar to martineau's answer except that it doesn't pull the entire input list into memory (bad idea) or even calcuate pairper at all.

from collections import defaultdict
paircount = defaultdict(int)
pairtime = defaultdict(float)

#get number of pair occrences and total time 
with open('input.txt', 'r') as f:
  with open('output.txt', 'w') as o: 
    for numline, line in enumerate(f, 1):
        line = line.split()
        pair = line[0], line[1]
        paircount[pair] += 1
        pairtime[pair] += float(line[2])

    for pair, c in paircount.iteritems():
        #print pair[0], pair[1], c, pairper[pair], pairtime[pair]
        o.write("%s, %s, %s, %s, %s\n" % (pair[0], pair[1], c, c * 100.0 / numline, pairtime[pair]))

Upvotes: 1

Related Questions