user10060973
user10060973

Reputation:

String Index Error (Julia)

I'm a Julia newbie. When I was testing out the language, I got this error.

First of all, I'm defining String b to "he§y".

Julia seems behaving strangely when I have "special" characters in a String...

When I'm trying to get the third character of b (it's supposed to be '§'), everything is OK

However when I'm trying to get the fourth character of b (it's supposed to be 'y'), a "StringIndexError" is thrown.

Upvotes: 0

Views: 1543

Answers (2)

HinduWarrior
HinduWarrior

Reputation: 61

'§' character takes up 2 byte. So the index s[4] is skipped and next index is s[5]

the characters of "he§y" are arranged in memory as s[1]: h s[2]: e s[3]s[4]: § (indices 3 and 4 are clubbed together like a super 3 so there is no 4th index. s[5]: y

Upvotes: 0

Adrian Shum
Adrian Shum

Reputation: 40076

I don't believe the compiler could throw you the error. Do you mean a runtime error?

I know nothing about Julian language but the symptoms seems to be related to indexing of string is not based on code point, but to some encoding.

The document from Julia lang seems supporting my hypothesis:

https://docs.julialang.org/en/stable/manual/strings/

The built-in concrete type used for strings (and string literals) in Julia is String. This supports the full range of Unicode characters via the UTF-8 encoding. (A transcode function is provided to convert to/from other Unicode encodings.)

...

Conceptually, a string is a partial function from indices to characters: for some index values, no character value is returned, and instead an exception is thrown. This allows for efficient indexing into strings by the byte index of an encoded representation rather than by a character index, which cannot be implemented both efficiently and simply for variable-width encodings of Unicode strings.


Edit: Quoted from Julia document, which is an example demonstrating exact "problem" you are facing.

julia> s = "\u2200 x \u2203 y"
"∀ x ∃ y"

Whether these Unicode characters are displayed as escapes or shown as special characters depends on your terminal's locale settings and its support for Unicode. String literals are encoded using the UTF-8 encoding. UTF-8 is a variable-width encoding, meaning that not all characters are encoded in the same number of bytes. In UTF-8, ASCII characters – i.e. those with code points less than 0x80 (128) – are encoded as they are in ASCII, using a single byte, while code points 0x80 and above are encoded using multiple bytes – up to four per character. This means that not every byte index into a UTF-8 string is necessarily a valid index for a character. If you index into a string at such an invalid byte index, an error is thrown:

julia> s[1]
'∀': Unicode U+2200 (category Sm: Symbol, math)

julia> s[2]
ERROR: StringIndexError("∀ x ∃ y", 2)
[...]

julia> s[3]
ERROR: StringIndexError("∀ x ∃ y", 3)
Stacktrace:
[...]

julia> s[4]
' ': ASCII/Unicode U+0020 (category Zs: Separator, space)

Upvotes: 6

Related Questions