Rana
Rana

Reputation: 507

File size in UTF-8 encoding?

I have created a file with UTF-8 encoding, but I don't understand the rules for the size it takes up on disk. Here is my complete research:

  1. First I created the file with a single Hindi letter 'क' and the file size on Windows 7 was
    8 bytes.

  2. Now with two letter 'कक' and the file size was 11 bytes.

  3. Now with three letter 'ककक'and the file size was 14 bytes.

Can someone please explain me why it is showing such sizes?

Upvotes: 11

Views: 2749

Answers (2)

Bal Krishna Jha
Bal Krishna Jha

Reputation: 7206

On linux based systems, you can use hexdump to get the hexadecimal dump(used by Tim in his answer) and understand how many bytes a character is allocating.

echo -n a | hexdump -C echo -n क | hexdump -C

Here's the output of the above two command. enter image description here

Upvotes: 0

Tim Pietzcker
Tim Pietzcker

Reputation: 336158

The first three bytes are used for the BOM (Byte Order Mark) EF BB BF.

Then, the bytes E0 A4 95 encode the letter क.

Then the bytes 0D 0A encode a carriage return.

Total: 8 bytes. For each letter क you add, you need three more bytes.

Upvotes: 10

Related Questions