Converting a Numerical String to an Array of Bytes Represented in Binary

Question

I’d like to convert a string containing only integers to an array of bytes, but to be stored efficiently (so no “digits[digitIndex] = string[digitIndex - ‘0’;”). I would like them to be stored like any type is stored: having 256 different possibilities per byte, not only 10 as in the previous, faulty example. It also needs to hold a lot of digits (I’m using an 8-bit parameter as the size, so at least 100 digits I believe). Edit: I also do not want to use any libraries whatsoever for personal reasons.

Here’s an example of what it would look like in a function:

int8_t *stringToBigInt(char *input) {
    uint8_t digitsBase10 = strlen(input);
    uint8_t bytes = ???; //However many bytes to store the result (max 255 bytes in this case)
    int8_t *result = malloc(sizeof(void *) + bytes);
    ... //Code for setting result to input
    return result;
}

And here’s an example of a possible input and output:

Edit: This is a short example that fits into 32-bits only for simplicity; an input could be much more than a 32-bit (and possibly 64-bit) integer

Input: “1234567890”

Output: {01001001, 10010110, 00000010, 11010010}

Kuba hasn&#39;t forgotten Monica · Accepted Answer

This is a base conversion from base-10 to base-256, so that’s what you should look for as far as algorithms go. For a simplistic implementation, first implement long division by powers of 2 working on strings. Then convert each of the remainders to a byte: these bytes form your output. You’ll want to repeatedly divide the input, and each string of 8 remainder bit remainders forms the base-256 bytes, starting at the least significant digit (one byte is one base-256 digit). Repeated division means that you feed the quotient of the preceding division to the succeeding one, as the dividend.

There are some cool algorithms that can divide base-10 numbers by powers of two, that operate much faster and are simpler than generalized long division. As a hint, let’s take an example: 510. We divide each digit by two, and feed the remainder*5 to the next digit. Let’s drop the fractional part smaller than 0.5: 510 becomes 2*100 + 5*10 + 5. Then 1*100 + 2*10 + 2 dot 5. Then 6*10 + 1. Then 3*10 dot 5, 2*10 + 5, then 1*10 + 2 dot 5, then 6, then 3, then 2 dot 5, then 1, then 0 dot 5.

For 255 we’d get 127.5, 63.5, 15.5, 7.5, 3.5, 1.5, 0.5.

Division by higher factors of two is possible, but requires repeated long additions. E.g. 33 div 4 = 0*10 + 7rem1 + 0 rem 0.75 (ha!). Divisions by two work better since we use the fact that 10=2*5, and base-n notation can be divided by factors of the base easily, without performing long additions: all operations are limited to two adjacent digits, so it’s a linear time process with cost N in number of digits. But for base conversion to base-256 you do repeated division, so the cost is ~0.5N^2. Easy to implement but costly in computations.

There are better algorithms than that, of course. But the above can be implemented concisely - even in the form of reasonably good quality library functions:

First, let's define an array-of-bytes type, and a way to dump it to human-readable hexadecimal output. For convenience, the object is referred to via the pointer to its data, and the implementation detail doesn't figure anywhere in the interface at all. The constructor new_Bytes zero-initializes the array. There is also a method that treats the array as if it was an array of bits, ordered lest-endian (LSB first), and sets (turns on) a given bit.

// https://github.com/KubaO/stackoverflown/tree/master/questions/decimal-to-binary-54422895
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

// Bytes Class Interface

typedef uint8_t *Bytes;
typedef const uint8_t *cBytes;

Bytes new_Bytes(size_t size);
size_t Bytes_size(cBytes bytes);
void Bytes_truncate(Bytes bytes, size_t new_size);
void free_Bytes(cBytes bytes);
char *Bytes_to_hex(cBytes bytes);

static inline void Bytes_set_bit(Bytes const bytes, size_t const bit_num) {
   bytes[bit_num / 8] |= 1 << (bit_num % 8);
}

Then, the division-by-2 is performed in-place, and the flags provide additional information needed for base conversion - especially the remainder. The conversion from base 10 to base 256 uses the division and returns a new Bytes array.

// Division and Base Conversion Interface

typedef enum {
   REMAINDER = 1,           /* there is a non-zero remainder */
   ZERO = 2,                /* the quotient is zero or null */
   NULL_DECIMAL = 4,        /* the dividend is null or empty */
   NON_DECIMALS = 8,        /* division was terminated on non-decimal characters */
   LEADING_ZERO_COUNT = 16, /* count of leading zeroes in the quotient */
   LEADING_ZERO_COUNT_MASK = ~(LEADING_ZERO_COUNT - 1),
   CLR_CARRY_MASK = ~REMAINDER,
   CLR_ZERO_MASK = ~ZERO,
} DivFlags;

DivFlags divide_by_2(char *decimal);
Bytes base_10_to_256(const char *decimal);

The division operates on the decimal representation, in order from most-significant to least-significant digit. Each digit is merged with the remainder from the prior digit's division, and then is divided by 2. The remainder is carried between digit divisions. After division of the least significant digit, the remainder is output in the flags.

The flags are mostly self-explanatory, but LEADING_ZERO_COUNT isn't quite - and thus the access to it is implemented via accessor functions. LEADING_ZERO_COUNT is the unit of the count of leading zeroes. As the division steps though the decimal representation, it will count the leading zeroes, multiply them by this unit, and merge it with the flags. To extract the count, the flags are divided by the unit.

// Division and Base Conversion Implementation

static inline int leading_zero_count(DivFlags const flags) {
   return (flags & LEADING_ZERO_COUNT_MASK) / LEADING_ZERO_COUNT;
}

static inline void saturated_inc_leading_zero_count(DivFlags *flags) {
   if ((*flags & LEADING_ZERO_COUNT_MASK) != LEADING_ZERO_COUNT_MASK)
      *flags += LEADING_ZERO_COUNT;
}

DivFlags divide_by_2(char *decimal) {
   DivFlags flags = ZERO;
   if (!decimal) return flags | NULL_DECIMAL;
   char c;
   while ((c = *decimal)) {
      if (c < '0' || c > '9') return flags | NON_DECIMALS;
      c = c - '0' + ((flags & REMAINDER) ? 10 : 0);
      if (c & 1)
         flags |= REMAINDER;
      else
         flags &= CLR_CARRY_MASK;
      c >>= 1;
      assert(c >= 0 && c <= 9);
      if (c)
         flags &= CLR_ZERO_MASK;
      else if (flags & ZERO)
         saturated_inc_leading_zero_count(&flags);
      *decimal++ = c + '0';
   }
   return flags;
}

Then, the base conversion performs repeated division by 2, and shifts the remainder bits into the byte array, as follows:

First, the base conversion takes a copy of the decimal representation, and allocates the output byte array of the appropriate size.

static void base_10_to_256_impl(Bytes const bytes, char *decimal);

Bytes base_10_to_256(const char *const decimal) {
   assert(decimal);
   size_t const dec_len = strlen(decimal);
   char *const dec_buf = malloc(dec_len + 1);
   if (!dec_buf) return NULL;
   memcpy(dec_buf, decimal, dec_len + 1);

   size_t const BASE_RATIO_NUM = 416, /* ceil(log(10)/log(256)*1000) */
       BASE_RATIO_DENOM = 1000;
   assert(dec_len <= (SIZE_MAX / BASE_RATIO_NUM));
   size_t const len = (size_t)(dec_len * BASE_RATIO_NUM / BASE_RATIO_DENOM) + 1;
   Bytes const bytes = new_Bytes(len);  // little-endian
   if (bytes) base_10_to_256_impl(bytes, dec_buf);
   free(dec_buf);
   return bytes;
}

Then, in the "meat" of the implementation, the function iterates the output bits, repeatedly dividing the decimal representation by 2, and sets each bit with the value of the remainder bit.

static void base_10_to_256_impl(Bytes const bytes, char *decimal) {
   size_t const len = Bytes_size(bytes);
   for (size_t bit_num = 0;; bit_num++) {
      DivFlags const flags = divide_by_2(decimal);
      assert(!(flags & NULL_DECIMAL));
      decimal += leading_zero_count(flags);
      if (flags & ZERO && !(flags & REMAINDER)) {
         size_t const new_len = ((bit_num + 7) / 8);
         Bytes_truncate(bytes, new_len);
         break;
      }
      // here, there are still non-zero bits - in the dec[imal] and/or in the carry
      assert((bit_num / 8) < len);
      if (flags & REMAINDER) Bytes_set_bit(bytes, bit_num);
   }
}

We can now add some tests:

// Tests

void check_bytes(const char *const decimal, const char *const bytes_expected,
                 size_t const bytes_len, const char *const hex_expected) {
   cBytes const bytes = base_10_to_256(decimal);
   assert(bytes && Bytes_size(bytes) == bytes_len);
   assert(memcmp(bytes, bytes_expected, bytes_len) == 0);
   char *const hex = Bytes_to_hex(bytes);
   assert(hex && strcmp(hex, hex_expected) == 0);
   printf("%s
", hex);
   free(hex);
   free_Bytes(bytes);
}

int main() {
   check_bytes("4294967297" /* 2^32+1 */, "\1\0\0\0\1", 5, "01 00000001");
   check_bytes("4294967296" /* 2^32   */, "\0\0\0\0\1", 5, "01 00000000");
   check_bytes("4294967295" /* 2^32-1 */, "\xFF\xFF\xFF\xFF", 4, "FFFFFFFF");
   check_bytes("16777217" /* 2^24+1 */, "\1\0\0\1", 4, "01000001");
   check_bytes("16777216" /* 2^24   */, "\0\0\0\1", 4, "01000000");
   check_bytes("16777215" /* 2^24-1 */, "\xFF\xFF\xFF", 3, "FFFFFF");
   check_bytes("256", "\0\1", 2, "0100");
   check_bytes("255", "\xFF", 1, "FF");
   check_bytes("254", "\xFE", 1, "FE");
   check_bytes("253", "\xFD", 1, "FD");
   check_bytes("3", "\3", 1, "03");
   check_bytes("2", "\2", 1, "02");
   check_bytes("1", "\1", 1, "01");
   check_bytes("0", "\0", 1, "00");
}

The implementation of the Bytes class concludes the example:

// Bytes Implementation

struct BytesImpl {
   size_t size;
   uint8_t data[1];
};
static const size_t Bytes_header_size = offsetof(struct BytesImpl, data);
_Static_assert(offsetof(struct BytesImpl, data) == sizeof(size_t),
               "unexpected layout of struct BytesImpl");

Bytes new_Bytes(size_t size) {
   assert(size <= SIZE_MAX - Bytes_header_size);
   if (!size) size++;
   struct BytesImpl *const impl = calloc(Bytes_header_size + size, 1);
   if (!impl) return NULL;
   impl->size = size;
   return &impl->data[0];
}

static const struct BytesImpl *Bytes_get_const_impl_(cBytes const bytes) {
   return (const struct BytesImpl *)(const void *)((const char *)bytes -
                                                   Bytes_header_size);
}

static struct BytesImpl *Bytes_get_impl_(Bytes const bytes) {
   return (struct BytesImpl *)(void *)((char *)bytes - Bytes_header_size);
}

size_t Bytes_size(cBytes const bytes) { return Bytes_get_const_impl_(bytes)->size; }

void Bytes_truncate(Bytes const bytes, size_t new_size) {
   size_t *const size = &Bytes_get_impl_(bytes)->size;
   if (!new_size) {
     new_size++;  // we always leave one byte in the array
     bytes[0] = 0;
   }
   assert(*size);
   if (*size <= new_size) return;
   *size = new_size;
}

void free_Bytes(cBytes const bytes) {
   if (bytes) free((void *)(intptr_t)(const void *)Bytes_get_const_impl_(bytes));
}

char *Bytes_to_hex(cBytes const bytes) {
   size_t n = Bytes_size(bytes);
   size_t spaces = (n - 1) / 4;
   char *const out = malloc(n * 2 + spaces + 1);
   if (out)
      for (char *o = out; n;) {
         uint8_t const c = bytes[n - 1];
         snprintf(o, 3, "%02" PRIX8, c);
         o += 2;
         n--;
         if (n && n % 4 == 0) {
            assert(spaces);
            *o++ = ' ';
            spaces--;
         }
      }
   return out;
}

Converting a Numerical String to an Array of Bytes Represented in Binary

Answers (1)

Related Questions