JW O
JW O

Reputation: 73

Fastest way to compare chars of 'A', 'C', 'G', 'T'

I'm hoping to boost the speed of my bioinformatics algorithm, which requires comparisons of characters that are one of 'A', 'C', 'G', 'T' (e.g. computing 'A' == 'C')

Since the size of a char is 8 bits, it requires 8 comparisons of binary numbers in worst case scenario. My guess was that, by representing 'A', 'C', 'G', 'T' as a pair of binary numbers (e.g. 'A' as make_pair(false,false)), I thought I could possibly boost the speed by 3~4 times since it now only requires 2 binary comparisons at maximum.

I tried using a pair of bools, but speed actually dropped by about 30%.

What is the fastest possible way to represent the four characters and to compute equality? Memory usage isn't so much of a problem for my case.

For your information, I'm using C++11 compiler. Thank you in advance.

Upvotes: 0

Views: 184

Answers (2)

tomc
tomc

Reputation: 1207

perhaps starting over with different initial concepts... I dug up an old note from helping someone learning this stuff about a decade ago. Maybe it can give you a helpfully different perspective.

https://gist.github.com/TomConlin/6cd976151d36dd3e2a9fb34842b9c66e

Upvotes: 0

Naomi
Naomi

Reputation: 5486

The amount of bits that can be compared using a single instruction depends on your CPU architecture. A 64bit architecture means that you can run computations on a 64bit word in a single instruction, not 64 instructions. So comparing two 8bit words ('A'=='G') takes exactly 1 CPU cycle to compute.

If you want to boost your speed, you can represent your characters in 2bit words, but pack 32 words in a 64bit variable, and run the comparisons 32 words at a time, thus reducing the amount of CPU cycles by a factor of 32.

Now, if you want to compare multiple 64bit variables that are stored sequentially in an array, you can make use of memcmp to allow the sequential scan to be optimized properly.

Upvotes: 6

Related Questions