Reputation: 263

Finding the index of a set of characters in huge sequence of characters

Let's say I have an extremely large sequence of characters of A-D, 4 Billion to be exact. My goal is to find the indexes of several new sequences of letters that are set at length 30 within that large sequence of characters. The problem also increases in difficulty when the sequence you are looking for has a small error (a letter is wrong). How should I tackle this problem?

The trivial method is to iterate one letter at a time across the whole 4 Billion text file, but that'll take forever with memory running out.

I've been told to utilize a hashmap, but I'm not sure exactly what to use as my key value pair. The idea of using regex has also come up, but I'm not entirely sure if it'll work with my problem. Any help in terms of direction would be appreciated. Thanks!

Here's an illustration of what I'm asking:

Upvotes: 4

Answers (3)

Matt

Reputation: 11815

Here's a quick and easy bit of code to deal with the representation.

public static enum Nucleotide { 
    A,B,C,D;
}

public static int setbit(int val, int pos, boolean on) {
    if (on) {
                    // set bit
        return val | (1 << (8-pos-1));
    }
    else {
                    // unset bit
        return val & ~(1 << (8-pos-1));         
    }
}

public static int set2bits(int val, int pos, int bits) {
            // set/unset the first bit 
    val = setbit(val, pos, (bits & 2) > 0);
            // set/unset the second bit
    val = setbit(val, pos+1, (bits & 1) > 0);

    return val;
}

public static int setNucleotide(int sequence, int pos, Nucleotide tide) {
            // set both bits based on the ordinal position in the enum
    return set2bits(sequence, pos*2, tide.ordinal());
}

public static void setNucleotide(int [] sequence, int pos, Nucleotide tide) {
            // figure out which element in the array to work with
    int intpos = pos/4;
            // figure out which of the 4 bit pairs to work with.
    int bitpos = pos%4;
    sequence[intpos] = setNucleotide(sequence[intpos], bitpos, tide);       
}

public static Nucleotide getNucleotide(int [] sequence, int pos) {
    int intpos = pos/4;
    int bitpos = pos%4;
    int val = sequence[intpos];
            // get the bits for the requested on, and shift them
            // down into the least significant bits so we can
            // convert batch to the enum.
    int shift = (8-(bitpos+1)*2);       
    int tide = (val & (3 << shift)) >> shift;
    return Nucleotide.values()[tide];

}

public static void main(String args[]) {
    int sequence[] = new int[4];
    setNucleotide(sequence, 4, Nucleotide.C);
    System.out.println(getNucleotide(sequence, 4));
}

Obviously there's a lot of bitshifting going on, but the small # of comments should make sense as to what's going on.

Of course the downside of this representation is that you are working in groups of 4. If you want say 10 nucleotides, you'll have to keep another variable somewhere with the count so that you know the last 2 nucleotides in the sequence aren't useful.

The fuzzy matching can be done with brute force if nothing else. You would take a sequence of N nucleotides in, then, starting at 0, check against nucleotides 0:N-1 and see how many match. Then you go from 1:N then 2:N+1, etc...

Upvotes: 1

Marko Topolnik

Reputation: 200296

By encoding in characters you are wasting 14 bits on every 2 you use. You could encode four nucleotide letters in just a single byte, then you'd need only half a gigabyte. As for an algorithm, you can study the code in java.lang.String.indexOf and the Wikipedia page on Boyer-Moore algorithm.

BTW you could make the search instantaneous if you employed the Lucene index for this. The idea would be to index every 30-letter subsequence as a separate document in Lucene. As for error tolerance, you'd need to use N-grams, or make a fuzzy search (in Lucene 4 there's a new algorithm to quickly locate strings with edit distance up to 2 or 3).

Upvotes: 3

linuxuser27

Reputation: 7353

This is a classic problem call the longest common subsequence (LCS). There are many algorithms to solve it. The genome project does this kind of search a lot. The wiki link provided has many examples. Your threshold for errors would be a special case.

Are you doing something with gene sequencing? I ask only because you mention only 4 variables :)

Upvotes: 4

Finding the index of a set of characters in huge sequence of characters

Answers (3)

Related Questions