Use Python for summing subset of columns in csv

Question

I am trying to optimize some code to run faster. For this text file matrix:

TAG, DESC, ID1, ID2, ID3, ID4, 
1, "details", 0, 1, NA, 1, 2
2, "details", 2, 1, NA, 0, 1
3, "details", 1, NA, NA, 0, 2
...

This is a large file with ~10,000 columns and ~2M rows. What i would like to do is calculate the sum of across all ID's (4 for TAG=1) and the frequency given that the maximum value is 2 (so 4/8 = 0.5), and then append these values are new columns. The NA's are missing data and effectively zeros. This code works but it is very slow:

tab_dict =csv.DictReader(open(path), delimiter=",") 
tab_reader = [row for row in tab_dict]

for t in tab_reader:
    idlist = [i for i in t.keys()] 
    idlist.remove('TAG')  #exclude columns that do not contain numbers for summing
    idlist.remove('DESC')
    rowsum = 0
    for i in idlist:
        try: rowsum+= int(t[i]) #try/except to handle "NA"s
        except: TypeError
    t["ROWSUM"] = rowsum  # create the new columns
    t["ROWFREQ"] = float(rowsum)/ float(2*len(idlist))

Any suggestions on how to speed this up? Thanks

Peter DeGlopper · Accepted Answer

Is the point of reading the whole file into the tab_reader list so you can modify it as you go? That's something you should avoid if at all possible. Since the reader is an iterator, if your later processing doesn't require the whole file to be in memory it would be better to write the modified output line-by-line.

This version still uses a dictreader, but more built-in tools and should be faster:

ignored_keys = frozenset(('TAG', 'DESC'))   
desired_keys = [key for key in tab_dict.fieldnames if key not in ignored_keys]
frequency_divisor = float(2*len(desired_keys))
for t in tab_reader:
    rowsum = sum((int(t[key]) for key in desired_keys if t[key] != 'NA'))
    rowfreq = float(rowsum) / frequency_divisor
    t["ROWSUM"] = rowsum
    t["ROWFREQ"] = rowfreq

I think you'd see marginal additional improvements from using a csv.reader instance rather than DictReader and precomputing a list of desired indices instead of keys. Or, of course, if 'TAG' and 'DESC' are always the first two:

   rowsum = sum((int(x) for x in t[2:] if x != 'NA'))

Use Python for summing subset of columns in csv

Answers (2)

Related Questions