Reputation: 5510
Previously, I wrote a substring function of unicode strings over grapheme clusters as follows. The positions passed to the function were over Unicode scalars, for instance, \r\n
is counted as 2, but Grapheme clusters count \r\n
as 1. So this function did not work well in some cases:
let uni_sub (s: string) (pos: int) (len: int) =
let (_, r) =
Uuseg_string.fold_utf_8
`Grapheme_cluster
(fun (p, acc) ch -> if (p >= pos) && (p <= pos+len-1) then (p+1, acc ^ ch) else (p+1, acc))
(0, "")
s
in
r
I'm suggested to write a substring function of unicode strings over their scalars, by using Uutf.String.fold_utf_8
and Buffer.add_utf_8_uchar
. However, without understanding well how the system works, I could only roughly write the following code and wanted to make the types work in the first place.
let uni_sub_scalars (s: string) (pos: int) (len: int) =
let b: Buffer.t = Buffer.create 42 in
let rec add (acc: string list) (v: [ `Uchar of Stdlib.Uchar.t | `Await | `End ]) : Uuseg.ret =
match v with
| `Uchar u ->
Buffer.add_utf_8_uchar b u;
add acc `Await
| `Await | `End -> failwith "don't know what to do"
in
let (_, r) =
Uuseg_string.fold_utf_8
(`Custom (Uuseg.custom ~add:add))
(fun (p, acc) ch -> if (p >= pos) && (p <= pos+len-1) then (p+1, acc ^ ch) else (p+1, acc))
(0, "")
s
in
r
And the compilation returned an error that I don't know how to fix:
File "lib/utility.ml", line 45, characters 6-39:
45 | (`Custom (Uuseg.custom ~add:add))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Error: This expression has type
[> `Custom of
?mandatory:(string list -> bool) ->
name:string ->
create:(unit -> string list) ->
copy:(string list -> string list) -> unit -> Uuseg.custom ]
but an expression was expected of type [< Uuseg.boundary ]
Types for tag `Custom are incompatible
make: *** [lib/utility.cmo] Error 2
Could anyone help me write this substring function of Unicode strings by scalars?
Upvotes: 0
Views: 81
Reputation: 35210
And the compilation returned an error that I don't know how to fix:
The Uuseg.custom
function creates a custom segmenter and takes a few parameters (you passed only one),
val custom :
?mandatory:('a -> bool) ->
name:string ->
create:(unit -> 'a) ->
copy:('a -> 'a) ->
add: ('a -> [ `Uchar of Uchar.t | `Await | `End ] -> ret) -> unit -> custom
So you need to pass also the name
, create
, copy
parameters as well as the positional ()
parameter. But I don't think that this is the function that you should use.
Could anyone help me write this substring function of Unicode strings by scalars?
Yes, if we will follow the advice and implement it "by using Uutf.String.fold_utf_8
and Buffer.add_utf_8_uchar
", it is very easy. (Notice, that we were advised to use Uutf.String.fold_utf_8
not Uuseg_string.fold_utf_8
).
A simple implementation (that doesn't do a lot of error checking), will look like this,
let substring s pos len =
let buf = Buffer.create len in
let _ : int = Uutf.String.fold_utf_8 (fun off _ elt ->
match elt with
| `Uchar x when off >= pos && off < pos + len ->
Buffer.add_utf_8_uchar buf x;
off + 1
| _ -> off + 1) 0 s in
Buffer.contents buf
Here is how it works (using my name as a working example),
# substring "Иван\n\rГотовчиц" 0 5;;
- : string = "Иван\n"
# substring "Иван\n\rГотовчиц" 11 3;;
- : string = "чиц"
And it works fine with the right-to-left scripts,
# let shalom = substring "שָׁלוֹ";;
val shalom : int -> int -> string = <fun>
# shalom 0 1;;
- : string = "ש"
# shalom 0 2;;
- : string = "שָ"
# shalom 2 2;;
- : string = "ׁל"
# shalom 2 1;;
- : string = "ׁ"
Upvotes: 1