alexhli
alexhli

Reputation: 409

Use Python for summing subset of columns in csv

I am trying to optimize some code to run faster. For this text file matrix:

TAG, DESC, ID1, ID2, ID3, ID4, 
1, "details", 0, 1, NA, 1, 2
2, "details", 2, 1, NA, 0, 1
3, "details", 1, NA, NA, 0, 2
...

This is a large file with ~10,000 columns and ~2M rows. What i would like to do is calculate the sum of across all ID's (4 for TAG=1) and the frequency given that the maximum value is 2 (so 4/8 = 0.5), and then append these values are new columns. The NA's are missing data and effectively zeros. This code works but it is very slow:

tab_dict =csv.DictReader(open(path), delimiter=",") 
tab_reader = [row for row in tab_dict]

for t in tab_reader:
    idlist = [i for i in t.keys()] 
    idlist.remove('TAG')  #exclude columns that do not contain numbers for summing
    idlist.remove('DESC')
    rowsum = 0
    for i in idlist:
        try: rowsum+= int(t[i]) #try/except to handle "NA"s
        except: TypeError
    t["ROWSUM"] = rowsum  # create the new columns
    t["ROWFREQ"] = float(rowsum)/ float(2*len(idlist))

Any suggestions on how to speed this up? Thanks

Upvotes: 0

Views: 991

Answers (2)

Peter DeGlopper
Peter DeGlopper

Reputation: 37319

Is the point of reading the whole file into the tab_reader list so you can modify it as you go? That's something you should avoid if at all possible. Since the reader is an iterator, if your later processing doesn't require the whole file to be in memory it would be better to write the modified output line-by-line.

This version still uses a dictreader, but more built-in tools and should be faster:

ignored_keys = frozenset(('TAG', 'DESC'))   
desired_keys = [key for key in tab_dict.fieldnames if key not in ignored_keys]
frequency_divisor = float(2*len(desired_keys))
for t in tab_reader:
    rowsum = sum((int(t[key]) for key in desired_keys if t[key] != 'NA'))
    rowfreq = float(rowsum) / frequency_divisor
    t["ROWSUM"] = rowsum
    t["ROWFREQ"] = rowfreq

I think you'd see marginal additional improvements from using a csv.reader instance rather than DictReader and precomputing a list of desired indices instead of keys. Or, of course, if 'TAG' and 'DESC' are always the first two:

   rowsum = sum((int(x) for x in t[2:] if x != 'NA'))

Upvotes: 2

Hans Then
Hans Then

Reputation: 11322

For one, you loop through the data twice, once to create the tab_reader and the once to sum all values. Second, you might want to ditch your DictReader and simply loop though the file yourself. Also, using built-in functions will speed up.

from __future__ import division # to use float division

for line in open(path):
    ids = line.split(',')[2:]
    ids = [int(id) if id != 'NA' else 0 for id in ids]
    rowsum = sum(ids)
    rowfreq = rowsum / 2 * len(ids)

Upvotes: 2

Related Questions