user2363399
user2363399

Reputation: 187

Searching for 8-bit aligned string in a 9-bit architecture

I need to search for a 8-bit aligned bit string "00 00 01" (hex) in a character stream. On a typical architecture, I would do it as such:

char *find(char *first, char *last)
{
    char pattern[] = {0, 0, 1};
    char *p;

    for (p = first; last - p >= sizeof(pattern); ++p) {
        if (!memcmp(p, pattern, sizeof(pattern))
            return p;
    }

    return 0;
}

However, I don't know how I would implement this function (with good performance) if char was not 8-bits.

Upvotes: 2

Views: 228

Answers (3)

Flanker
Flanker

Reputation: 408

The task is quite interesting one, so I'm coming with another option. It doesn't require to convert your stream bits into chars, instead we could use following pattern.

Since your bit values shall have 8 bit alignment, possible char index / it's start bit options are:

char 0, bit 0 (its starting bit index)
char 0, bit 8
char 1, bit 7
char 2, bit 6
char 3, bit 5
char 4, bit 4
char 5, bit 3
char 6, bit 2
char 7, bit 1

for the char 8, the starting bit would be 0, so it same as 1st item (char 0, bit 0)

Now except first position, remaining 8 possibilities are easy to verify by single expression:

The pseudo code:

int   pattern = 0x000001L;
int   mask = ~pattern;
int   char_idx = 0;

while (first <= last-2)   // need to compare 3 chars
{
    int   value = *((int*)first)); // this will actually access 4 chars, if stream has no 0 terminator, it will produce exception

    // special [char 0, bit 0] case
    if ( !char_idx && (value & mask) == pattern )
    {
        // match! do something with *first
    }

    if ( ((value >> (8 - char_idx)) & mask) == pattern ) 
    {
        // match! do something with *first
    }

    if ( ++char_idx == 9 )
       char_idx = 0;

    first++;
}

NOTE: if your int is not 36 bit, you can do char-by-char comparision

Upvotes: 2

Ruud Helderman
Ruud Helderman

Reputation: 11018

The following code should work under the following conditions:

  • long is 36 bits (4 bytes of 9 bits each)
  • big-endian architecture; the most siginificant bits of a long value are stored at the lowest address
  • long * can point to any address, not necessarily a multiple of four; in other words, no word or dword alignment
  • the pattern we are searching for is 24 bits (can be adjusted, but absolute maximum for this approach is 28 bits)

The function does not return a char * since that doesn't say much about the actual bit position. Instead, it returns the number of 8-bit groups that precede the match, or -1 if no match.

long find(char *first, char *last)
{
    long pattern = 0x000001L;      // the bit string we are searching for
    long bitmask = -0x1000L;       // initial mask: 24 ones followed by 12 zeroes
    long maxcount = ((last - first) * 9 - 24) / 8;    // 24 = pattern size (bits)
    long count;                    // counts the 8-bit groups
    char *slider = first;          // follows the 9-bit bytes
    for (count = 0; count <= maxcount; count++) {
        long actual = (*(long *)slider & bitmask);
        long expect = (bitmask & -bitmask) * pattern;
        if (actual == expect) return count;

        if (bitmask & 0xFF) {    // less than 8 zeroes on the right-hand side
            slider++;
            bitmask <<= 1;       // shift 9 bits to left, then 8 bits to right
        }
        else {
            bitmask >>= 8;       // shift 8 bits to the right, only
        }
    }
    return -1;
}

I have no idea how to test this, so it's on an 'as is' basis.

The function uses a bitmask with exactly 24 ones. The bits are continuously shifted 8 positions to the right. If a '1' threatens to be shifted out, then the memory pointer slider is incremented and the bitmask is adjusted accordingly.

slider is defined as char *, and cast to long * when being dereferenced, retrieving four 9-bit bytes in one go. If I would define slider as long *, then slider++ would proceed the pointer by 4 bytes instead of one.

Here's an example to explain this obscure expression: (bitmask & -bitmask) * pattern

  • 000011111111111111111111111100000000 = bitmask
  • 111100000000000000000000000100000000 = -bitmask
  • 000000000000000000000000000100000000 = (bitmask & -bitmask)
  • 0000pppppppppppppppppppppppp00000000 = (bitmask & -bitmask) * pattern

As you can see, it aligns pattern (the 24 bits pppppppppppppppppppppppp) with the bitmask.

Please let me know how this works out for you.

Upvotes: 1

Flanker
Flanker

Reputation: 408

What the hell is that, the 9-bit architecture? :) So the 'char' type is also 9 bit? :)

The quick and dirty way would be to convert char stream into bit-to-char representation, in other words every char representing a bit. Then just search for substring of "000001", with the alignment of 8 chars (memcmp at first[0], then memcp at first[8] etc)... Surely it is possible to do it in a 'binary'/right way, but depending on how long the stream is this could be the 'good performance' way...

Upvotes: 0

Related Questions