Algorithm to collapse forward and reverse complement of a DNA sequence in python?

Question

I have a collection of fasta files containing many sequence fragments of DNA. I’m attempting to count the total occurrences of k-mers that can be found in each file. The nice part of counting k-mers is that a single array can be created of size 4**k, where k is the size of of the k-mer being used. The sequence files that I’m processing are short read sequences from new generation sequencing machines and so assuming that the reads are all from the 5’ -> 3’ end can’t be done. The best way to solve that issue would be to map all k-mers observed to a single index for the forward and reverse complement sequences.   

Desired mapping:

with k = 2 & starting index for the array is 0 

string = ‘aa’; maps to index -> 0 

string = ‘tt’; maps to index -> 0 

string = ‘at’; maps to index -> 1  

By hand I was able to figure out that the array for all mers with the collapse of forward and reverse complements would be of length 10 and the specific indices would represent the following mers:   AA, AT, AC, AG, TA, TC, TG, CC, CG, GC  

I’m having trouble thinking of a generalized algorithm to know the number of possible mers for larger sizes of k. How many cells should be allocated in the count array?

  In my existing code I’ve come up with these three functions to process the fragments, generate the reverse complement, and map the mer (or reverse complement) to an index.

This first function will take the mer string and return the index which relates to the mer in the 4**k size array.

def mer_index_finder(my_string, mer_size):
    # my_string = my_string.lower()
    char_value = {}
    char_value["a"] = 0
    char_value["t"] = 1
    char_value["c"] = 2
    char_value["g"] = 3
    i = 0
    j = 0
    base_four_string = ""

    while(i < mer_size):
        base_four_string += str(char_value[my_string[i]])
        i += 1

    index = int(base_four_string, 4)

    return index

This function processes all of the DNA fragments and maps the counts to an index in the array of size 4**k

def get_mer_count(mer_size, file_fragments, slidingSize):
    mer_counts = {}
    for fragment in file_fragments:
        j = 0
        max_j = len(fragment) - mer_size
        while( j < max_j):
            mer_frag = fragment[j:j+mer_size]
            mer_frag = mer_frag.lower()
            if( "n" not in mer_frag):
                try:
                    mer_counts[mer_frag] += 1
                except:
                    mer_counts[mer_frag] = 1
            j += slidingSize

    myNSV = [0] * (4**mer_size)
    for mer in mer_counts.keys():
        mer_index = mer_index_finder(mer, mer_size)
        # examples showing how to collapse, 
        # without shrinking the array
        # rev_mer = make_complment_mer(mer)
        # print rev_mer
        # rev_index = mer_index_finder(rev_mer, mer_size)
        # min_index = min(mer_index, rev_index)
        # print mer_index,"	",rev_index,"	",min_index
        # myNSV[min_index] += mer_counts[mer]
        myNSV[mer_index] = mer_counts[mer]

    return myNSV[:]

Finally this function will take a mer and generate the reverse complement:

def make_complment_mer(mer_string):
    nu_mer = ""
    compliment_map = {"a" : "t", "c" : "g", "t" : "a", "g" : "c"}
    for base in mer_string:
        nu_mer += compliment_map[base]
    nu_mer = nu_mer[::-1]
    return nu_mer[:]

It seems like there should be an obvious way to always know how many cells the array should have when collapsing the forward and reverse complements together, and there's examples in the literature and some packages showing this has been done; however, I'm not finding where in the source code they are able to generate these calculations.

The second part to this question is how would you know if a mer is the forward or reverse complement without checking both?

Example:

(forward)

AAGATCACGG

(complement)

TTCTAGTGCC

(reverse complement)

CCGTGATCTT

In my above code I take the lower of the two indices, but it seems like there should be a way to figure this out without having to find the index for each mer twice: once forward and once as the reverse complement.

TL;DR what will the size of the array be if forward and reverse complements are mapped to the same index?

Edit: To find the array size using the answer I modified the get_mer_count() to include the following lines to create the size of the index:

array_size = (4 ** mer_size) / 2
if mer_size % 2 == 0:
    array_size += 2**(mer_size - 1)

myNSV = [0] * array_size

augurar · Accepted Answer

For each k-mer, there are two possibilities: either it has exactly one reverse complement, or it is its own reverse compliment (a "palindromic" mer). So if there are p palindromic k-mers then we know the array size should be p + (4**k - p)/2.

For k odd, there are no palindromic mers, since the middle nucleotide cannot be its own compliment. So the array should have size 4**k / 2.
For k even then k = 2*j for some j. A mer is palindromic if and only if its first half is the reverse compliment of its second half. There are 4**j possible first halves, so there are p = 4**j = 2**k palindromic k-mers. Thus using our formula above the array should have size p + (4**k - p)/2 = 2**k + (4**k - 2**k)/2.

Algorithm to collapse forward and reverse complement of a DNA sequence in python?

Answers (1)

Related Questions