Reputation: 26195
Unambiguous DNA sequences consist only of the nucleobases adenine (A), cytosine (C), guanine (G), thymine (T). For human consumption, the bases may be represented by the corresponding char
in either uppercase or lowercase: A
, C
, G
, T
, or a
, c
, g
, t
. This representation is inefficient, however, when long sequences need to be stored. Since only four symbols need to be stored, each symbol can be assigned a 2-bit code. The commonly used .2bit
-format specified by UCSC does exactly that, using the following encoding: T = 0b00
, C = 0b01
, A = 0b10
, G = 0b11
.
The C code below shows a reference implementation written for clarity. Various open-source software that converts genomic sequences represented as a char
sequence typically uses a 256-entry lookup table indexed by each char
in sequence. This also isolates from the internal representation of char
. However, memory access is energetically expensive, even if the access is to an on-chip cache, and generic table look-ups are difficult to SIMDize. It would therefore be advantageous if the conversion could be accomplished by simple integer arithmetic. Given that ASCII is the dominating char
encoding, one can restrict to that.
What are efficient computational approaches to convert nucleobases given as ASCII characters to their .2bit
-representation?
/* Convert nucleobases A, C, G, T represented as either uppercase or lowercase
ASCII characters into UCSC .2bit-presentation. Passing any values other than
those corresponding to 'A', 'C', 'G', 'T', 'a', 'c', 'g', 't' results in an
indeterminate response.
*/
unsigned int ascii_to_2bit (unsigned int a)
{
const unsigned int UCSC_2BIT_A = 2;
const unsigned int UCSC_2BIT_C = 1;
const unsigned int UCSC_2BIT_G = 3;
const unsigned int UCSC_2BIT_T = 0;
switch (a) {
case 'A':
case 'a':
return UCSC_2BIT_A;
break;
case 'C':
case 'c':
return UCSC_2BIT_C;
break;
case 'G':
case 'g':
return UCSC_2BIT_G;
break;
case 'T':
case 't':
default:
return UCSC_2BIT_T;
break;
}
}
Upvotes: 5
Views: 418
Reputation: 12708
The following routine will fill an array of uint32_t
values with the contents of an ASCII string filled with the characters you stated and will conserve the state in order to be able to append a second, third, etc. number of strings to the array. The way of using it is illustrated with a main()
routine that takes the string arguments from the command line and passes them to an array of TOTAL
elements. The parameters to the routine are described in the comment above it.
#include <ctype.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#define UCSC_2BIT_T (0)
#define UCSC_2BIT_C (1)
#define UCSC_2BIT_A (2)
#define UCSC_2BIT_G (3)
#define PAIRS_PER_INTEGER 16
/**
* Converts a string of nucleobases in ASCII form into an array
* of 32bit integers in 2bitform. As it should be possible to
* continue the array of integers, a status reference is
* maintained in order to determine determine where in the
* integer to start putting the two bit sequences. For this,
* a (*state) variable is maintained (initialize it to 0) to
* remember where to start putting bitpairs in the array.
*
* @param state_ref reference to the state variable to be maintained
* with the position on which to put the base in the
* array entry.
* @param dna string with the ASCII chain of bases.
* @param twobit array reference to start filling.
* @param sz size of the array ***in array entries***.
* @return the number of array elements written. This allows to
* use a pointer and advance it at each function call
* with the number of entries consumed on each call.
*/
ssize_t
dna2twobit(
int *state_ref,
char *dna,
uint32_t twobit[],
size_t sz)
{
/* local copy so we only change the state in case of
* successful execution */
int state = 30 - *state_ref;
if (state == 30) *twobit = 0;
ssize_t total_nb = 0; /* total number of array elements consumed */
while (*dna && sz) {
char base = toupper(*dna++);
uint32_t tb;
switch (base) {
case 'A': tb = UCSC_2BIT_A; break;
case 'C': tb = UCSC_2BIT_C; break;
case 'T': tb = UCSC_2BIT_T; break;
case 'G': tb = UCSC_2BIT_G; break;
default: return -1;
}
*twobit |= tb << state;
state -= 2;
if (state < 0) {
--sz; ++twobit;
state += 32;
*twobit = 0;
total_nb++;
}
}
*state_ref = 30 - state;
return total_nb;
}
This function can be linked separately into any program you wish, but I have provide a main()
code to illustrate the use of the function. You can call the program with the actual chains in ASCII encoded in the command line parameters. The program will encode them one after the previous in order to demonstrate multi conversion (16 bases fit in one 32bit integer, as stated by the definition of the format you posted in the question, so in case not a multiple of 16 have been encoded, the status reflects how many bits have been covered in the last one.
The code continues with the main function below:
#define TOTAL 16
int main(int argc, char **argv)
{
int state = 0, i;
uint32_t twobit_array[TOTAL], *p = twobit_array;
size_t twobit_cap = TOTAL;
for (i = 1; i < argc; ++i) {
printf("Adding the following chain: [%s]\n", argv[i]);
ssize_t n = dna2twobit(
&state,
argv[i],
p,
twobit_cap);
if (n < 0) {
fprintf(stderr,
"argument invalid: %s\n"
"continuing to next\n",
argv[i]);
continue;
}
twobit_cap -= n;
p += n;
}
if (!state) --p;
uint32_t *q = twobit_array;
size_t off = 0;
for (int j = 0; q <= p; j++) {
char *sep = "";
printf("%09zd: ", off);
for (int k = 0; k < 4 && q <= p; k++) {
printf("%s%08x", sep, *q);
sep = "-";
q++;
off += 16;
}
printf("\n");
}
return 0;
}
Upvotes: 1
Reputation: 5040
One option is the following:
unsigned int ascii_to_2bit (unsigned int a)
{
return ((0xad - a) & 6) >> 1;
}
This has the advantage that it needs only 8 bits, doesn't overflow, and doesn't contain non-constant shifts, so it can be immediately used to parallelize even without specific SIMD instructions, e.g. with 8 characters input in a 64-bit word
unsigned int ascii_to_2bit_8bytes (uint64_t a)
{
return ((0xadadadadadadadad - a) & 0x0606060606060606) >> 1;
}
leaving the two output bits at the bottom of each byte.
Upvotes: 4
Reputation: 26195
If one stares at the binary codes for the ASCII characters for the nucleobases intently, it becomes clear that bits 1 and 2 provide a unique two-bit code: A
= 0b01000001
-> 0b00
, C
= 0b01000011
-> 0b01
, G
= 0b01000111
-> 0b11
, T
= 0b01010100
-> 0b10
. Analogous for the lowercase ASCII characters, which differ merely in bit 5. Unfortunately this simple mapping does not quite match the .2bit
-encoding, in that the codes for A and T are swapped. One way of fixing this is with a simple four-entry permutation table stored in a variable, likely assigned to a register after optimization ("in-register lookup-table"):
unsigned int ascii_to_2bit_perm (unsigned int a)
{
unsigned int perm = (2 << 0) | (1 << 2) | (0 << 4) | (3 << 6);
return (perm >> (a & 6)) & 3;
}
An alternative method corrects the generated "naive" code from extracting bits 1 and 2 on the fly via simple bit manipulation, by observing that bit 1 is 0 for A
and T
, but 1 for C
and G
. Therefore we can swap the encodings for A and T by XOR-ing the inverse of bit 1 of the input with bit 1 of the preliminary code:
unsigned int ascii_to_2bit_twiddle (unsigned int a)
{
return ((a >> 1) & 3) ^ (~a & 2);
}
This version is advantageous on processors with fast bitfield extraction instructions and low-end processors without a barrel shifter, as only a right shift by one bit position is required. On out-of-order processors this approach offers more instruction-level parallelism than the permutation table. It also seems easier to adapt to a SIMD implementation since the same shift count is used in all byte lanes.
Before I stared intently at the binary encoding of the relevant ASCII characters, I had looked into using a simple mathematical computation. A simple brute-force search over small multipliers and divisors yielded:
unsigned int ascii_to_2bit_math (unsigned int a)
{
return ((18 * (a & 31)) % 41) & 3;
}
The multiplier 18
is friendly to processors without a fast integer multiplier. Modulo computation with a compile-time constant divisor is handled efficiently by modern compilers, and no division is required. Even so, I noticed that even the best available compilers have trouble taking advantage of the very limited range of inputs and outputs, so I had to hand-massage this to simplify the code:
unsigned int ascii_to_2bit_math (unsigned int a)
{
unsigned int t = 18 * (a & 31);
return (t - ((t * 25) >> 10)) & 3;
}
Even in this form and assuming the availability of a single-cycle multiply this seems generally not competitive with the two prior approaches, as it yields more instructions and a longer dependency chain. However, on 64-bit platforms this entire computation can be replaced with a 64-bit, 32-entry lookup table which can give competitive performance if this 64-bit table can be placed into a register efficiently, which is the case for x86-64, where it is loaded as a immediate.
unsigned int ascii_to_2bit_tab (unsigned int a)
{
uint64_t tab = ((0ULL << 0) | (2ULL << 2) | (0ULL << 4) | (1ULL << 6) |
(3ULL << 8) | (0ULL << 10) | (2ULL << 12) | (3ULL << 14) |
(1ULL << 16) | (3ULL << 18) | (0ULL << 20) | (2ULL << 22) |
(3ULL << 24) | (1ULL << 26) | (2ULL << 28) | (0ULL << 30) |
(1ULL << 32) | (3ULL << 34) | (1ULL << 36) | (2ULL << 38) |
(0ULL << 40) | (1ULL << 42) | (3ULL << 44) | (0ULL << 46) |
(2ULL << 48) | (0ULL << 50) | (1ULL << 52) | (3ULL << 54) |
(0ULL << 56) | (2ULL << 58) | (3ULL << 60) | (1ULL << 62));
return (tab >> (2 * (a & 31))) & 3;
}
I am appending my test framework for reference:
#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
#define ORIGINAL_MATH (1)
unsigned int ascii_to_2bit_perm (unsigned int a)
{
unsigned int perm = (2 << 0) | (1 << 2) | (0 << 4) | (3 << 6);
return (perm >> (a & 6)) & 3;
}
unsigned int ascii_to_2bit_twiddle (unsigned int a)
{
return ((a >> 1) & 3) ^ (~a & 2);
}
unsigned int ascii_to_2bit_math (unsigned int a)
{
#if ORIGINAL_MATH
return ((18 * (a & 31)) % 41) & 3;
#else // ORIGINAL_MATH
unsigned int t = 18 * (a & 31);
return (t - ((t * 25) >> 10)) & 3;
#endif // ORIGINAL_MATH
}
unsigned int ascii_to_2bit_tab (unsigned int a)
{
uint64_t tab = ((0ULL << 0) | (2ULL << 2) | (0ULL << 4) | (1ULL << 6) |
(3ULL << 8) | (0ULL << 10) | (2ULL << 12) | (3ULL << 14) |
(1ULL << 16) | (3ULL << 18) | (0ULL << 20) | (2ULL << 22) |
(3ULL << 24) | (1ULL << 26) | (2ULL << 28) | (0ULL << 30) |
(1ULL << 32) | (3ULL << 34) | (1ULL << 36) | (2ULL << 38) |
(0ULL << 40) | (1ULL << 42) | (3ULL << 44) | (0ULL << 46) |
(2ULL << 48) | (0ULL << 50) | (1ULL << 52) | (3ULL << 54) |
(0ULL << 56) | (2ULL << 58) | (3ULL << 60) | (1ULL << 62));
return (tab >> (2 * (a & 31))) & 3;
}
/* Convert nucleobases A, C, G, T represented as either uppercase or lowercase
ASCII characters into UCSC .2bit-presentation. Passing any values other than
those corresponding to 'A', 'C', 'G', 'T', 'a', 'c', 'g', 't' results in an
inderminate response.
*/
unsigned int ascii_to_2bit (unsigned int a)
{
const unsigned int UCSC_2BIT_A = 2;
const unsigned int UCSC_2BIT_C = 1;
const unsigned int UCSC_2BIT_G = 3;
const unsigned int UCSC_2BIT_T = 0;
switch (a) {
case 'A':
case 'a':
return UCSC_2BIT_A;
break;
case 'C':
case 'c':
return UCSC_2BIT_C;
break;
case 'G':
case 'g':
return UCSC_2BIT_G;
break;
case 'T':
case 't':
default:
return UCSC_2BIT_T;
break;
}
}
int main (void)
{
char nucleobase[8] = {'A', 'C', 'G', 'T', 'a', 'c', 'g', 't'};
printf ("Testing permutation variant:\n");
for (unsigned int i = 0; i < sizeof nucleobase; i++) {
unsigned int ref = ascii_to_2bit (nucleobase[i]);
unsigned int res = ascii_to_2bit_perm (nucleobase[i]);
printf ("i=%2u %c res=%u ref=%u %c\n",
i, nucleobase[i], res, ref, (res == ref) ? 'p' : 'F');
}
printf ("Testing bit-twiddling variant:\n");
for (unsigned int i = 0; i < sizeof nucleobase; i++) {
unsigned int ref = ascii_to_2bit (nucleobase[i]);
unsigned int res = ascii_to_2bit_twiddle (nucleobase[i]);
printf ("i=%2u %c res=%u ref=%u %c\n",
i, nucleobase[i], res, ref, (res == ref) ? 'p' : 'F');
}
printf ("Testing math-based variant:\n");
for (unsigned int i = 0; i < sizeof nucleobase; i++) {
unsigned int ref = ascii_to_2bit (nucleobase[i]);
unsigned int res = ascii_to_2bit_math (nucleobase[i]);
printf ("i=%2u %c res=%u ref=%u %c\n",
i, nucleobase[i], res, ref, (res == ref) ? 'p' : 'F');
}
printf ("Testing table-based variant:\n");
for (unsigned int i = 0; i < sizeof nucleobase; i++) {
unsigned int ref = ascii_to_2bit (nucleobase[i]);
unsigned int res = ascii_to_2bit_tab (nucleobase[i]);
printf ("i=%2u %c res=%u ref=%u %c\n",
i, nucleobase[i], res, ref, (res == ref) ? 'p' : 'F');
}
return EXIT_SUCCESS;
}
Upvotes: 6