Converting nucleobase representation from ASCII to UCSC .2bit

Unambiguous DNA sequences consist only of the nucleobases adenine (A), cytosine (C), guanine (G), thymine (T). For human consumption, the bases may be represented by the corresponding char in either uppercase or lowercase: A, C, G, T, or a, c, g, t. This representation is inefficient, however, when long sequences need to be stored. Since only four symbols need to be stored, each symbol can be assigned a 2-bit code. The commonly used .2bit-format specified by UCSC does exactly that, using the following encoding: T = 0b00, C = 0b01, A = 0b10, G = 0b11.

The C code below shows a reference implementation written for clarity. Various open-source software that converts genomic sequences represented as a char sequence typically uses a 256-entry lookup table indexed by each char in sequence. This also isolates from the internal representation of char. However, memory access is energetically expensive, even if the access is to an on-chip cache, and generic table look-ups are difficult to SIMDize. It would therefore be advantageous if the conversion could be accomplished by simple integer arithmetic. Given that ASCII is the dominating char encoding, one can restrict to that.

What are efficient computational approaches to convert nucleobases given as ASCII characters to their .2bit-representation?

/* Convert nucleobases A, C, G, T represented as either uppercase or lowercase
   ASCII characters into UCSC .2bit-presentation. Passing any values other than
   those corresponding to 'A', 'C', 'G', 'T', 'a', 'c', 'g', 't' results in an
   indeterminate response.
*/
unsigned int ascii_to_2bit (unsigned int a)
{
    const unsigned int UCSC_2BIT_A = 2;
    const unsigned int UCSC_2BIT_C = 1;
    const unsigned int UCSC_2BIT_G = 3;
    const unsigned int UCSC_2BIT_T = 0;

    switch (a) {
    case 'A':
    case 'a':
        return UCSC_2BIT_A;
        break;
    case 'C':
    case 'c':
        return UCSC_2BIT_C;
        break;
    case 'G':
    case 'g':
        return UCSC_2BIT_G;
        break;
    case 'T':
    case 't':
    default:
        return UCSC_2BIT_T;
        break;
    }
}

Upvotes: 5

Answers (3)

Luis Colorado

Reputation: 12708

The following routine will fill an array of uint32_t values with the contents of an ASCII string filled with the characters you stated and will conserve the state in order to be able to append a second, third, etc. number of strings to the array. The way of using it is illustrated with a main() routine that takes the string arguments from the command line and passes them to an array of TOTAL elements. The parameters to the routine are described in the comment above it.

#include <ctype.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define UCSC_2BIT_T  (0)
#define UCSC_2BIT_C  (1)
#define UCSC_2BIT_A  (2)
#define UCSC_2BIT_G  (3)

#define PAIRS_PER_INTEGER   16

/**
 * Converts a string of nucleobases in ASCII form into an array
 * of 32bit integers in 2bitform.  As it should be possible to
 * continue the array of integers, a status reference is
 * maintained in order to determine determine where in the
 * integer to start putting the two bit sequences.  For this,
 * a (*state) variable is maintained (initialize it to 0) to
 * remember where to start putting bitpairs in the array.
 *
 * @param state_ref reference to the state variable to be maintained
 *               with the position on which to put the base in the
 *               array entry.
 * @param dna    string with the ASCII chain of bases.
 * @param twobit array reference to start filling.
 * @param sz     size of the array ***in array entries***.
 * @return the number of array elements written.  This allows to
 *         use a pointer and advance it at each function call
 *         with the number of entries consumed on each call.
 */
ssize_t
dna2twobit(
    int *state_ref,
    char *dna,
    uint32_t twobit[],
    size_t sz)
{
    /* local copy so we only change the state in case of
     * successful execution */
    int state = 30 - *state_ref;
    if (state == 30) *twobit = 0;
    ssize_t total_nb = 0; /* total number of array elements consumed */
    while (*dna && sz) {
        char base = toupper(*dna++);
        uint32_t tb;
        switch (base) {
        case 'A': tb = UCSC_2BIT_A; break;
        case 'C': tb = UCSC_2BIT_C; break;
        case 'T': tb = UCSC_2BIT_T; break;
        case 'G': tb = UCSC_2BIT_G; break;
        default: return -1;
        }
        *twobit |= tb << state;
        state -= 2;
        if (state < 0) {
            --sz; ++twobit;
            state += 32;
            *twobit = 0;
            total_nb++;
        }
    }
    *state_ref = 30 - state;
    return total_nb;
}

This function can be linked separately into any program you wish, but I have provide a main() code to illustrate the use of the function. You can call the program with the actual chains in ASCII encoded in the command line parameters. The program will encode them one after the previous in order to demonstrate multi conversion (16 bases fit in one 32bit integer, as stated by the definition of the format you posted in the question, so in case not a multiple of 16 have been encoded, the status reflects how many bits have been covered in the last one.

The code continues with the main function below:


#define TOTAL 16

int main(int argc, char **argv)
{
    int state = 0, i;
    uint32_t twobit_array[TOTAL], *p = twobit_array;
    size_t twobit_cap = TOTAL;

    for (i = 1; i < argc; ++i) {
        printf("Adding the following chain: [%s]\n", argv[i]);
        ssize_t n = dna2twobit(
                        &state,
                        argv[i],
                        p,
                        twobit_cap);
        if (n < 0) {
            fprintf(stderr,
                    "argument invalid: %s\n"
                    "continuing to next\n",
                    argv[i]);
            continue;
        }
        twobit_cap -= n;
        p += n;
    }
    if (!state) --p;
    uint32_t *q = twobit_array;
    size_t off = 0;
    for (int j = 0; q <= p; j++) {
        char *sep = "";
        printf("%09zd: ", off);
        for (int k = 0; k < 4 && q <= p; k++) {
            printf("%s%08x", sep, *q);
            sep = "-";
            q++;
            off += 16;
        }
        printf("\n");
    }
    return 0;
}

Upvotes: 1

Falk Hüffner

Reputation: 5040

One option is the following:

unsigned int ascii_to_2bit (unsigned int a)
{
    return ((0xad - a) & 6) >> 1;
}

This has the advantage that it needs only 8 bits, doesn't overflow, and doesn't contain non-constant shifts, so it can be immediately used to parallelize even without specific SIMD instructions, e.g. with 8 characters input in a 64-bit word

unsigned int ascii_to_2bit_8bytes (uint64_t a)
{
    return ((0xadadadadadadadad - a) & 0x0606060606060606) >> 1;
}

leaving the two output bits at the bottom of each byte.

Upvotes: 4

njuffa

Reputation: 26195

If one stares at the binary codes for the ASCII characters for the nucleobases intently, it becomes clear that bits 1 and 2 provide a unique two-bit code: A = 0b01000001 -> 0b00, C = 0b01000011 -> 0b01, G = 0b01000111 -> 0b11, T = 0b01010100 -> 0b10. Analogous for the lowercase ASCII characters, which differ merely in bit 5. Unfortunately this simple mapping does not quite match the .2bit-encoding, in that the codes for A and T are swapped. One way of fixing this is with a simple four-entry permutation table stored in a variable, likely assigned to a register after optimization ("in-register lookup-table"):

unsigned int ascii_to_2bit_perm (unsigned int a)
{
    unsigned int perm = (2 << 0) | (1 << 2) | (0 << 4) | (3 << 6);
    return (perm >> (a & 6)) & 3;
}

An alternative method corrects the generated "naive" code from extracting bits 1 and 2 on the fly via simple bit manipulation, by observing that bit 1 is 0 for A and T, but 1 for C and G. Therefore we can swap the encodings for A and T by XOR-ing the inverse of bit 1 of the input with bit 1 of the preliminary code:

unsigned int ascii_to_2bit_twiddle (unsigned int a)
{
    return ((a >> 1) & 3) ^ (~a & 2);
}

This version is advantageous on processors with fast bitfield extraction instructions and low-end processors without a barrel shifter, as only a right shift by one bit position is required. On out-of-order processors this approach offers more instruction-level parallelism than the permutation table. It also seems easier to adapt to a SIMD implementation since the same shift count is used in all byte lanes.

Before I stared intently at the binary encoding of the relevant ASCII characters, I had looked into using a simple mathematical computation. A simple brute-force search over small multipliers and divisors yielded:

unsigned int ascii_to_2bit_math (unsigned int a)
{
    return ((18 * (a & 31)) % 41) & 3;
}

The multiplier 18 is friendly to processors without a fast integer multiplier. Modulo computation with a compile-time constant divisor is handled efficiently by modern compilers, and no division is required. Even so, I noticed that even the best available compilers have trouble taking advantage of the very limited range of inputs and outputs, so I had to hand-massage this to simplify the code:

unsigned int ascii_to_2bit_math (unsigned int a)
{
    unsigned int t = 18 * (a & 31);
    return (t - ((t * 25) >> 10)) & 3;
}

Even in this form and assuming the availability of a single-cycle multiply this seems generally not competitive with the two prior approaches, as it yields more instructions and a longer dependency chain. However, on 64-bit platforms this entire computation can be replaced with a 64-bit, 32-entry lookup table which can give competitive performance if this 64-bit table can be placed into a register efficiently, which is the case for x86-64, where it is loaded as a immediate.

unsigned int ascii_to_2bit_tab (unsigned int a)
{
    uint64_t tab = ((0ULL <<  0) | (2ULL <<  2) | (0ULL <<  4) | (1ULL <<  6) |
                    (3ULL <<  8) | (0ULL << 10) | (2ULL << 12) | (3ULL << 14) |
                    (1ULL << 16) | (3ULL << 18) | (0ULL << 20) | (2ULL << 22) |
                    (3ULL << 24) | (1ULL << 26) | (2ULL << 28) | (0ULL << 30) |
                    (1ULL << 32) | (3ULL << 34) | (1ULL << 36) | (2ULL << 38) |
                    (0ULL << 40) | (1ULL << 42) | (3ULL << 44) | (0ULL << 46) |
                    (2ULL << 48) | (0ULL << 50) | (1ULL << 52) | (3ULL << 54) |
                    (0ULL << 56) | (2ULL << 58) | (3ULL << 60) | (1ULL << 62));
    return (tab >> (2 * (a & 31))) & 3;
}

I am appending my test framework for reference:

#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>

#define ORIGINAL_MATH (1)

unsigned int ascii_to_2bit_perm (unsigned int a)
{
    unsigned int perm = (2 << 0) | (1 << 2) | (0 << 4) | (3 << 6);
    return (perm >> (a & 6)) & 3;
}

unsigned int ascii_to_2bit_twiddle (unsigned int a)
{
    return ((a >> 1) & 3) ^ (~a & 2);
}

unsigned int ascii_to_2bit_math (unsigned int a)
{
#if ORIGINAL_MATH
    return ((18 * (a & 31)) % 41) & 3;
#else // ORIGINAL_MATH
    unsigned int t = 18 * (a & 31);
    return (t - ((t * 25) >> 10)) & 3;
#endif // ORIGINAL_MATH
}

unsigned int ascii_to_2bit_tab (unsigned int a)
{
    uint64_t tab = ((0ULL <<  0) | (2ULL <<  2) | (0ULL <<  4) | (1ULL <<  6) |
                    (3ULL <<  8) | (0ULL << 10) | (2ULL << 12) | (3ULL << 14) |
                    (1ULL << 16) | (3ULL << 18) | (0ULL << 20) | (2ULL << 22) |
                    (3ULL << 24) | (1ULL << 26) | (2ULL << 28) | (0ULL << 30) |
                    (1ULL << 32) | (3ULL << 34) | (1ULL << 36) | (2ULL << 38) |
                    (0ULL << 40) | (1ULL << 42) | (3ULL << 44) | (0ULL << 46) |
                    (2ULL << 48) | (0ULL << 50) | (1ULL << 52) | (3ULL << 54) |
                    (0ULL << 56) | (2ULL << 58) | (3ULL << 60) | (1ULL << 62));
    return (tab >> (2 * (a & 31))) & 3;
}

/* Convert nucleobases A, C, G, T represented as either uppercase or lowercase
   ASCII characters into UCSC .2bit-presentation. Passing any values other than
   those corresponding to 'A', 'C', 'G', 'T', 'a', 'c', 'g', 't' results in an
   inderminate response.
*/
unsigned int ascii_to_2bit (unsigned int a)
{
    const unsigned int UCSC_2BIT_A = 2;
    const unsigned int UCSC_2BIT_C = 1;
    const unsigned int UCSC_2BIT_G = 3;
    const unsigned int UCSC_2BIT_T = 0;

    switch (a) {
    case 'A':
    case 'a':
        return UCSC_2BIT_A;
        break;
    case 'C':
    case 'c':
        return UCSC_2BIT_C;
        break;
    case 'G':
    case 'g':
        return UCSC_2BIT_G;
        break;
    case 'T':
    case 't':
    default:
        return UCSC_2BIT_T;
        break;
    }
}

int main (void)
{
    char nucleobase[8] = {'A', 'C', 'G', 'T', 'a', 'c', 'g', 't'};
    printf ("Testing permutation variant:\n");
    for (unsigned int i = 0; i < sizeof nucleobase; i++) {
        unsigned int ref = ascii_to_2bit (nucleobase[i]);
        unsigned int res = ascii_to_2bit_perm (nucleobase[i]);
        printf ("i=%2u %c res=%u ref=%u %c\n", 
                i, nucleobase[i], res, ref, (res == ref) ? 'p' : 'F');
    }
    printf ("Testing bit-twiddling variant:\n");
    for (unsigned int i = 0; i < sizeof nucleobase; i++) {
        unsigned int ref = ascii_to_2bit (nucleobase[i]);
        unsigned int res = ascii_to_2bit_twiddle (nucleobase[i]);
        printf ("i=%2u %c res=%u ref=%u %c\n", 
                i, nucleobase[i], res, ref, (res == ref) ? 'p' : 'F');
    }
    printf ("Testing math-based variant:\n");
    for (unsigned int i = 0; i < sizeof nucleobase; i++) {
        unsigned int ref = ascii_to_2bit (nucleobase[i]);
        unsigned int res = ascii_to_2bit_math (nucleobase[i]);
        printf ("i=%2u %c res=%u ref=%u %c\n", 
                i, nucleobase[i], res, ref, (res == ref) ? 'p' : 'F');
    }
    printf ("Testing table-based variant:\n");
    for (unsigned int i = 0; i < sizeof nucleobase; i++) {
        unsigned int ref = ascii_to_2bit (nucleobase[i]);
        unsigned int res = ascii_to_2bit_tab (nucleobase[i]);
        printf ("i=%2u %c res=%u ref=%u %c\n", 
                i, nucleobase[i], res, ref, (res == ref) ? 'p' : 'F');
    }
    return EXIT_SUCCESS;
}

Upvotes: 6

Converting nucleobase representation from ASCII to UCSC .2bit

Answers (3)

Related Questions