Nabeel Ahmed
Nabeel Ahmed

Reputation: 23

Optimal substitute data structure to improve running time due to huge size of dictionary?

I have a python script where I initialize a dictionary containing around 4.9 million keys. Eack key has a list of 24 elements which I initialize to zero. I need to parse a text file containing around 9.7 million lines (20 columns each) and based on a specific match to the key of the dictionary, I increment the appropriate list integer of the key.

The problem is that the parsing is very slow and my job is getting killed(max 24 hr walltime on a cluster). The size of the dictionary to be initialized is around 200 Mb and after putting some time checks, I found that it takes around 16 minutes to parse 10,000 lines and hence it will take approx 242 hours to parse the entire 9.7 million lines

In short, I just need to count and increment the appropriate value of a dictionary key. Is there a substitute data structure for the python dictionary which can optimize this script and make it run in a reasonable amount of time?

def count_dict_init(file):
    gff_file = open(file, 'r')
    pos_list = []
    for line in gff_file:
       line_list = line.strip().split('\t')
       if line.startswith('chr') and line[0:5] != 'chrmt':
          if line_list[2] == 'CDS':
            leftpos = int(line_list[3])
            rightpos = int(line_list[4])
            for position in range(leftpos - 100, rightpos + 101):
                pos_list.append(position)

    uniq_list = set(pos_list)
    sorted_list = list(uniq_list)
    sorted_list.sort()
    pos_dict = {}
    for pos in sorted_list:
        pos_dict[pos] = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, '', '']

    print 'Size of count dicitonary is ', sys.getsizeof(pos_dict)    
    return pos_dict

def sam_parser(sam_file, count):
   dict_count = count
   parsed_file = open('Sam_parsed_dict.tab', 'w')
   non_cds_file = open('Non_Cds_file', 'w')
   for line in sam_file:
       if line[0] != '@':
          fields = line.split('\t')
          if len(fields) > 19:
              multi_flag = fields[19].strip()  
              # If the read has more than one alignment then report it as multiple mapping
              if multi_flag != 'NH:i:1':
                  multi_align = 'Y'
              else:
                  multi_align = 'N'
          else:
              multi_align = 'N'
        non_cds = False
        sam_flag = int(fields[1])
        chr_num = fields[2]
        read_length = len(fields[9])
        pos_in_value = (read_length - 27) * 2  #Determines which list position to update
        if 27 <= read_length <= 37:
           if sam_flag == 0:  # Primary alignment on forward strand
               five_prime = int(fields[3])
               if five_prime in dict_count.keys():
                   dict_count[five_prime][pos_in_value] += 1
                   aligner_cis = dict_count[five_prime][22]
                   if aligner_cis == 'Y':
                       continue
                   else:
                       dict_count[five_prime][22] = multi_align
               else:
                  non_cds = True
           if sam_flag == 16:  # On reverse strand
               five_prime = int(fields[3]) + read_length - 1
               if five_prime in dict_count.keys():
                   dict_count[five_prime][pos_in_value + 1] += 1
                   aligner_trans = dict_count[five_prime][23]
                   if aligner_trans == 'Y':
                       continue
                   else:
                       dict_count[five_prime][23] = multi_align
               else:
                  non_cds = True
           if sam_flag == 256:  # Not primary alignment
               five_prime = int(fields[3])
               if five_prime in dict_count.keys():
                  aligner_cis = dict_count[five_prime][22]
                  if aligner_cis == 'Y':
                     continue
                  else:
                     dict_count[five_prime][22] = multi_align
               else:
                  non_cds = True
           if sam_flag == 272:  # Not primary alignment and on reverse strand
               five_prime = int(fields[3]) + read_length - 1
               if five_prime in dict_count.keys():
                  aligner_trans = dict_count[five_prime][23]
                  if aligner_trans == 'Y':
                    continue
                  else:
                    dict_count[five_prime][23] = multi_align
               else:
                  non_cds = True
           if non_cds:
               non_cds_file.write(str(chr_num)+'\t'+str(fields[3])+'\n')

   for pos, counts in dict_count.iteritems():
        parsed_file.write(str(pos)+'\t'+'\t'.join(map(str, counts))+'\n')

   parsed_file.close()
   non_cds_file.close()

if __name__ == "__main__":
   # Parse arguments from commandline
   arguments = parse_arguments()
   GFF = arguments.gfffile
   chrnum = arguments.chrnum
   initial_count_dict = count_dict_init(GFF)
   SAM = open(arguments.inputPath)
   sam_parser(SAM, initial_count_dict)

Upvotes: 1

Views: 109

Answers (1)

Gabe
Gabe

Reputation: 86768

I think your problem is this expression: if five_prime in dict_count.keys():

This creates a new list of every key in your dictionary (4.9M) and then walks through it linearly until the key is found (or it goes through the whole list if the key isn't found).

Since looking up a key in a dictionary takes 1 operation and looking it up in the list is 4.9M operations, you want to use this instead: if five_prime in dict_count:.

Another thing is that you are going the lookups several times more than you need to. If doing the lookup in the dictionary is in any way a bottleneck, you can minimize it by doing the lookup only once per iteration. Here's some sample code:

           five_prime = int(fields[3])
           record = dict_count.get(five_prime)
           if record is not None:
               record[pos_in_value] += 1
               aligner_cis = record[22]
               if aligner_cis == 'Y':
                   continue
               else:
                   record[22] = multi_align
           else:
              non_cds = True

Upvotes: 12

Related Questions