FatalCatharsis
FatalCatharsis

Reputation: 3567

Accessing chars in utf-8 strings

First, I want to make sure I understand the concept of utf-8 correctly. When a string is stored in utf-8 each character is actually 1-4 bytes depending on the character it is representing.

If I had an ascii string like this:

string meh = "blah";

then all it has to do to obtain the fourth char is to obtain a pointer to the first char and add an offset of size char to locate the character, like this:

meh[3] == *(meh + 3);    // not real code, just pseudo c for what the compiler does

however, If i have a string like this:

string peh = "blah★!";

and I wanted the exclamation point, peh[6] would not retrieve "!" but the second byte in the ★ character.

So would the only way to random access that character be a linear search? (Start from the beginning and for each char check it's length skipping over that many until I reach the correct character index). If so, why does everyone want to store files in utf-8? Wouldn't that just make parsing and analysis hugely more expensive?

(in context, I'm writing a language lexer and information all around says source files should be in utf-8, but if I support variable length characters, wouldn't that just complicate everything unnecessarily? Would it be acceptable to just support utf-8/ascii with just single byte characters for source files?)

Upvotes: 0

Views: 1653

Answers (3)

Remy Lebeau
Remy Lebeau

Reputation: 597176

So would the only way to random access that character be a linear search? (Start from the beginning and for each char check it's length skipping over that many until I reach the correct character index).

Yes, exactly.

If so, why does everyone want to store files in utf-8?

UTF-8 is more portable than UTF-16 or UTF-32 (UTF-8 has no endian issues), and is backwards compatible with ASCII, so it won't break a large majority of legacy apps. Also, UTF-8 is more compact in byte size than UTF-16 for Unicode codepoints U+0000 - U+007F, and is the same byte size as UTF-16 for codepoints U+0080 - U+07FF. So UTF-8 tends to be a better choice for handling the majority of the world's commonly used English/Latin-based languages. However, once you start dealing with Unicode codepoints above U+07FF (Asian languages, symbols, emojis, etc), UTF-16 usually becomes more compact than UTF-8

UTF-16 tends to be easier to work with when processing data, since it only deals with 1 codeunit for codepoints U+0000 - U+FFFF, compared to UTF-8's use of 1-3 codeunits for the same codepoints. UTF-16 uses 2 codeunits for the remaining codepoints, compared to UTF-8's use of 4 codeunits for the same codepoints.

But even then, UTF-16 is technically a variable-length encoding, so you can't really use random access with it, either. True random access is possible in UTF-8 only if the data contains codepoints U+0000 - U+007F and nothing higher, and is possible in UTF-16 only if the data contains codepoints U+0000 - U+FFFF and nothing higher. Anything else requires linear scanning. However, scanning through UTF-16 is easier than scanning through UTF-8 since fewer codeunits are involved. And UTF-16 is designed to easily detect leading and trailing codeunits to skip them during scanning, whereas UTF-8 does not lend itself as well to that.

Wouldn't that just make parsing and analysis hugely more expensive?

UTF-8 is better suited for storage and communications, but not necessarily easier for parsing. It depends on the languages involved. UTF-16 tends to be better suited for parsing, as long as you account for surrogate pairs.

If you don't want to handle variable-length characters, and need true random access, then use UTF-32 instead, since it uses only 1 codeunit for every possible codepoint.

in context, I'm writing a language lexer and information all around says source files should be in utf-8, but if I support variable length characters, wouldn't that just complicate everything unnecessarily?

Not necessarily, especially if you are only supporting forward parsing. Even with UTF-16, you have to account for variable-length characters as well.

Would it be acceptable to just support utf-8/ascii with just single byte characters for source files?

That depends on the requirements of your parser, but I would say no. Many users want to be able to embed Unicode data in their source files, and even use Unicode identifiers if possible. Even back in the Ansi days before Unicode, non-ASCII characters could be either single-byte or multi-byte depending on the charset used.

So unless you want to completely shun non-ASCII languages (which is not a good idea in today's international world), you should deal with variable-length characters in one form or another.

Upvotes: 4

Kedar Mhaswade
Kedar Mhaswade

Reputation: 4695

So would the only way to random access that character be a linear search? (Start from the beginning and for each char check it's length skipping over that many until I reach the correct character index).

With Unicode, instead of a character, you look for code point. Every concept of a character is given a unique number in Unicode. UTF-8 is one of the many ways to encode the Unicode code points. This means that if you are reading or writing UTF-8 encoded text, then you (or the library you use) need to know how the encoding works. Random access is byte-addressed. Unless you know the exact offset of that character in a given encoding, random access to it is not going to work.

If so, why does everyone want to store files in utf-8?

Well, UTF-8 is an encoding scheme that supports characters that are specified in the Unicode standard. If you ever had a requirement to write and read the characters beyond, say the ASCII character set, you will have to choose some encoding scheme to represent them and people steer clear by using the encoding that best suits their needs. It is true that this means that some storage requirements are to be taken into account. But are you worried about the length of the file more than accurate representation of the message content?

Wouldn't that just make parsing and analysis hugely more expensive?

No, not if there is no other way to represent the characters that you expect. If you know that all your text is going to fit in the ASCII character set, then there is no need to encode using UTF-8. (That said, UTF-8 is backward compatible with ASCII).

Upvotes: 0

tripleee
tripleee

Reputation: 189679

You are comparing apples and oranges. Unicode is hugely more expressive than ASCII; out of the popular encodings which support Unicode, UTF-8 is the simplest and most compact for the vast majority of cases, and the compatibility with ASCII for pure 7-bit text is a huge bonus.

If your code is completely dominated by character length calculations, and you need to support Unicode, consider using UTF-32 internally. (UTF-16 is variable length, too, because of surrogate pairs.)

Upvotes: 0

Related Questions