This code shows that char takes 4 bytes: println!("char : {}", std::mem::size_of::<char>()); Why does it take 4 bytes?. Does the size depend on the platform, or is it always 4 bytes? If it's always 4 bytes, it is for something special? Does the compiler guarantee some minimum size for the size of char ? In https://play.rust-lang.org/ I also get 4 bytes

Reputation: 21658

Why is the size of `char` 4 bytes in Rust?

This code shows that char takes 4 bytes:

println!("char : {}", std::mem::size_of::<char>());

Why does it take 4 bytes?.
Does the size depend on the platform, or is it always 4 bytes?
If it's always 4 bytes, it is for something special?
Does the compiler guarantee some minimum size for the size of char?

In https://play.rust-lang.org/ I also get 4 bytes

Upvotes: 24

Answers (4)

plugwash

Reputation: 10494

Why does it take 4 bytes?.

Because two bytes is not enough, and three would be slow and awkward to process.

Does the size depend on the platform, or is it always 4 bytes?

The documentation states

"char is guaranteed to have the same size, alignment, and function call ABI as u32 on all platforms."

There exist platforms with C compilers that use bytes larger than 8 bits. However, to the best of my knowledge rust does not currently target any such platform.

Upvotes: 1

Matthieu M.

Reputation: 299760

First of all: a char in Rust is a unique integral value representing a Unicode Scalar value. For example, consider 💩 (aka Pile of Poo, aka U+1F4A9), in Rust it will be represented by a char with a value of 128169 in decimal (that is 0x1F4A9 in hexadecimal):

fn main() {
    let c: char = "💩".chars().next().unwrap();
    println!("💩 is {} ({})", c, c as u32);
}

On the playpen.

With that said, the Rust char is 4 bytes because 4 bytes is the smallest power of 2 number of bytes which can hold the integral value of any Unicode Scalar value. The decision was driven by the domain, not by architectural constraints.

Note: the emphasis on Scalar value is that a number of "characters" as we see them are actually graphemes composed by multiple combining characters in Unicode, in this case multiple char are required.

Upvotes: 32

cruster946

Reputation: 477

Char is four bytes, it doesn't depend on the architecture.

Why? According to UTF-8 Wikipedia's article.

The first 128 characters (US-ASCII) need one byte. The next 1,920 characters need two bytes to encode. Three bytes are needed for characters in the rest of the Basic Multilingual Plane, which contains virtually all characters in common use. Four bytes are needed for characters in the other planes of Unicode.

So if you want to represent any possible Unicode character the compiler must save 4 bytes.

You should also consider Byte Alignment: http://www.eventhelix.com/realtimemantra/ByteAlignmentAndOrdering.htm

Upvotes: 4

DK.

Reputation: 58975

char is four bytes. It is always four bytes, it will always be four bytes. Four bytes it be, and four bytes shall it remain.

It's not for anything special; four bytes is simply the smallest power of two in which you can store any Unicode scalar value. Various other languages do the same thing.

Upvotes: 10

Why is the size of `char` 4 bytes in Rust?

Answers (4)

Related Questions