Reputation: 21658
This code shows that char
takes 4 bytes:
println!("char : {}", std::mem::size_of::<char>());
char
?In https://play.rust-lang.org/ I also get 4 bytes
Upvotes: 24
Views: 10887
Reputation: 10494
Why does it take 4 bytes?.
Because two bytes is not enough, and three would be slow and awkward to process.
Does the size depend on the platform, or is it always 4 bytes?
The documentation states
"char is guaranteed to have the same size, alignment, and function call ABI as u32 on all platforms."
There exist platforms with C compilers that use bytes larger than 8 bits. However, to the best of my knowledge rust does not currently target any such platform.
Upvotes: 1
Reputation: 299760
First of all: a char
in Rust is a unique integral value representing a Unicode Scalar value. For example, consider 💩 (aka Pile of Poo, aka U+1F4A9), in Rust it will be represented by a char
with a value of 128169
in decimal (that is 0x1F4A9
in hexadecimal):
fn main() {
let c: char = "💩".chars().next().unwrap();
println!("💩 is {} ({})", c, c as u32);
}
With that said, the Rust char
is 4 bytes because 4 bytes is the smallest power of 2 number of bytes which can hold the integral value of any Unicode Scalar value. The decision was driven by the domain, not by architectural constraints.
Note: the emphasis on Scalar value is that a number of "characters" as we see them are actually graphemes composed by multiple combining characters in Unicode, in this case multiple char
are required.
Upvotes: 32
Reputation: 477
Char is four bytes, it doesn't depend on the architecture.
Why? According to UTF-8 Wikipedia's article.
The first 128 characters (US-ASCII) need one byte. The next 1,920 characters need two bytes to encode. Three bytes are needed for characters in the rest of the Basic Multilingual Plane, which contains virtually all characters in common use. Four bytes are needed for characters in the other planes of Unicode.
So if you want to represent any possible Unicode character the compiler must save 4 bytes.
You should also consider Byte Alignment: http://www.eventhelix.com/realtimemantra/ByteAlignmentAndOrdering.htm
Upvotes: 4
Reputation: 58975
char
is four bytes. It is always four bytes, it will always be four bytes. Four bytes it be, and four bytes shall it remain.
It's not for anything special; four bytes is simply the smallest power of two in which you can store any Unicode scalar value. Various other languages do the same thing.
Upvotes: 10