Reputation: 454

Julia: How to read in and output characters with diacritics?

Processing ASCII characters beyond the range 1-127 can easily crash Julia.

mystring = "A-Za-zÀ-ÿŽž"
for i in 1:length(mystring)
    print(i,":::")
    print(Int(mystring[i]),"::" )
    println(  mystring[i]       )
end

gives me

1:::65::A
2:::45::-
3:::90::Z
4:::97::a
5:::45::-
6:::122::z
7:::192::À
8:::ERROR: LoadError: StringIndexError("A-Za-zÀ-ÿŽž", 8)
Stacktrace:
 [1] string_index_err(::String, ::Int64) at .\strings\string.jl:12
 [2] getindex_continued(::String, ::Int64, ::UInt32) at .\strings\string.jl:220
 [3] getindex(::String, ::Int64) at .\strings\string.jl:213
 [4] top-level scope at R:\_LV\STZ\Web_admin\Languages\Action\Returning\chars.jl:5
 [5] include(::String) at .\client.jl:457
 [6] top-level scope at REPL[18]:1

It crashes after outputting the first character outside the normal range, rather than during that output, which is mentioned in the answer to String Index Error (Julia)
If declaring the values in Julia one should declare them as Unicode, but I have these characters in my input.
The manual says that Julia looks at the locale, but is there an "everywhere" locale?

Is there some way to handle input and output of these characters in Julia?

I am working on Windows10, but I can switch to Linux if that works better for this.

Upvotes: 6

Answers (3)

GKi

Reputation: 39657

String indices in Julia refer to code units (= bytes for UTF-8), the fixed-width building blocks that are used to encode arbitrary characters (code points). This means that not every index into a String is necessarily a valid index for a character. If you index into a string at such an invalid byte index, an error is thrown.

You can use enumerate to get the value and the number of iteration.

mystring = "A-Za-zÀ-ÿŽž"

for (i, x) in enumerate(mystring)
    print(i,":::")
    print(Int(x),"::")
    println(x)
end
#1:::65::A
#2:::45::-
#3:::90::Z
#4:::97::a
#5:::45::-
#6:::122::z
#7:::192::À
#8:::45::-
#9:::255::ÿ
#10:::381::Ž
#11:::382::ž

In case you need the value and index of the string in bytes you can use pairs.

for (i, x) in pairs(mystring)
    print(i,":::")
    print(Int(x),"::")
    println(x)
end
#1:::65::A
#2:::45::-
#3:::90::Z
#4:::97::a
#5:::45::-
#6:::122::z
#7:::192::À
#9:::45::-
#10:::255::ÿ
#12:::381::Ž
#14:::382::ž

Upvotes: 2

zsalya

Reputation: 454

In preparation for de-minimising my MCVE for what I want to do, which involves advancing the string position not just in a for-all loop, I used the information in the post written by Bogumił Kamiński, to come up with this:

mystring = "A-Za-zÀ-ÿŽž"
for i in 1:length(mystring)
    print(i,":::")
    mychar = mystring[nextind(mystring, 0, i)]
    print(Int(mychar), "::")
    println(  mychar )
end

Upvotes: 1

Bogumił Kamiński

Reputation: 69869

Use eachindex to get a list of valid indices in your string:

julia> mystring = "A-Za-zÀ-ÿŽž"
"A-Za-zÀ-ÿŽž"

julia> for i in eachindex(mystring)
           print(i, ":::")
           print(Int(mystring[i]), "::")
           println(mystring[i])
       end
1:::65::A
2:::45::-
3:::90::Z
4:::97::a
5:::45::-
6:::122::z
7:::192::À
9:::45::-
10:::255::ÿ
12:::381::Ž
14:::382::ž

Your issue is related to the fact that Julia uses byte-indexing of strings, as is explained in the Julia Manual.

For example character À takes two bytes, therefore, since its location is 7 the next index is 9 not 8.

In UTF-8 encoding which is used by default by Julia only ASCII characters take one byte, all other characters take 2, 3 or 4 bytes, see https://en.wikipedia.org/wiki/UTF-8#Encoding.

For example for À you get two bytes:

julia> codeunits("À")
2-element Base.CodeUnits{UInt8, String}:
 0xc3
 0x80

I have also written a post at https://bkamins.github.io/julialang/2020/08/13/strings.html that tries to explain how byte-indexing vs character-indexing works in Julia.

If you have additional questions please comment.

Upvotes: 6

Julia: How to read in and output characters with diacritics?

Answers (3)

Related Questions