SJWard
SJWard

Reputation: 3739

Single Byte Character Codes

I'm trying to implement a compression algorithm and for it I need to have an array of single byte character codes - they must be single byte. I'm unfamiliar with how to deal with character codes in programming but the requirements are they should be single byte and also not OS/machine dependent. Is it as easy as being the integer values 0 to 255?

I used the following little snippet to see what characters are available to me if that is the case:

for (int i = 0; i < 256; i++) {
    std::cout << (char)i << std::endl;
}

It seems many of the first numbers are blank characters unseen and the last set are all displayed as ?

EDIT:

More specifically I'm trying to implement a similar algorithm to this paper. It chops up a DNA sequence into segments of 4, and converts them using a hash table, so for example AAAA converts to character encoded by a single byte character, AAAT is converted to a single byte character. For DNA a 4 byte in 1 byte is pretty good compression (although if I want to extend the alphabet from A, T, C, G to A, T, C, G, N, and - I would need 6^4 characters instead of the 256 needed for the 4 letter alphabet. I could reduce the compression from 4 in 1 to 3 in one and only need 216 single byte character codes.

This compression is part of something I'm trying to write that should read in sequences from a multiple sequence alignments (sequences/strings of the 6 letter alphabet A, T, C, G, N and -) which may be very very large. And remove everything uninformative for my analysis program. I plan to do this by compressing the sequence as much as possible, then finding the uninformative stuff in the compressed representation, and then expanding the remaining stuff and then do a second sweep and get rid of the remaining uninformative stuff in the uncompressed representation, before compressing the remaining informative stuff again in prep for the analysis program.

Perhaps there are better schemes than a hash scheme, I've heard of something called a reference based scheme I need to read up about. I've also thought that maybe once the DNA string has been compressed to the 256 single byte format, can this be further compressed with methods used to compress regular text?

Upvotes: 0

Views: 1419

Answers (2)

Alex B
Alex B

Reputation: 84812

Why you see this output

Some ASCII codes are non-printable. Use isprint() to check if a character is printable.

It's also worth noting what encoding is your shell using. Modern setups use UTF-8, so if you are trying to print extended ASCII codes, they may have been interpreted (incorrectly) as multibyte UTF-8 sequences, instead of ASCII, when outputting into the terminal.

How you should work with binary data

If you work on algorithms that operate on binary data, like compression, you are better off ignoring character encodings completely. Avoid interpreting data as strings in the terminal and treat it as a sequence of integers 0-255. Pipe data into hexdump or print integer values of each byte when debugging.

Upvotes: 1

christopher
christopher

Reputation: 1081

There are multiple character sets. If you want single byte -guaranteed single byte- then you need the ASCII character set. You can use a specific codepage if you want to support a non-English language, but you'll have to decide which one.

Also note that you can calculate with char (which is 8bits, signed) and byte (which is 8bits, unsigned).

Here's the list of characters and their interpretation: http://en.wikipedia.org/wiki/ASCII

Character sets are definitely OS dependent. I would advise you to work with UTF-8, and know that -mostly- you'll get single character bytes.

P.S. If you're compressing files, why do you even care ? Reading a file byte-by-byte (or char-by-char) and reproducing the same bytes/chars on the other side will definitely work.

Upvotes: 0

Related Questions