Reputation: 507
I have created a file with UTF-8 encoding, but I don't understand the rules for the size it takes up on disk. Here is my complete research:
First I created the file with a single Hindi letter 'क' and the file size on Windows 7 was
8 bytes.
Now with two letter 'कक' and the file size was 11 bytes.
Now with three letter 'ककक'and the file size was 14 bytes.
Can someone please explain me why it is showing such sizes?
Upvotes: 11
Views: 2749
Reputation: 7206
On linux based systems, you can use hexdump
to get the hexadecimal
dump(used by Tim in his answer) and understand how many bytes a character is allocating.
echo -n a | hexdump -C
echo -n क | hexdump -C
Here's the output of the above two command.
Upvotes: 0
Reputation: 336158
The first three bytes are used for the BOM (Byte Order Mark) EF BB BF
.
Then, the bytes E0 A4 95
encode the letter क.
Then the bytes 0D 0A
encode a carriage return.
Total: 8 bytes. For each letter क you add, you need three more bytes.
Upvotes: 10