SoftTimur
SoftTimur

Reputation: 5510

Substring function of Unicode strings by scalars

Previously, I wrote a substring function of unicode strings over grapheme clusters as follows. The positions passed to the function were over Unicode scalars, for instance, \r\n is counted as 2, but Grapheme clusters count \r\n as 1. So this function did not work well in some cases:

let uni_sub (s: string) (pos: int) (len: int) = 
  let (_, r) = 
    Uuseg_string.fold_utf_8 
      `Grapheme_cluster
      (fun (p, acc) ch -> if (p >= pos) && (p <= pos+len-1) then (p+1, acc ^ ch) else (p+1, acc))
      (0, "")
      s 
    in 
  r

I'm suggested to write a substring function of unicode strings over their scalars, by using Uutf.String.fold_utf_8 and Buffer.add_utf_8_uchar. However, without understanding well how the system works, I could only roughly write the following code and wanted to make the types work in the first place.

let uni_sub_scalars (s: string) (pos: int) (len: int) = 
  let b: Buffer.t = Buffer.create 42 in
  let rec add (acc: string list) (v: [ `Uchar of Stdlib.Uchar.t | `Await | `End ]) : Uuseg.ret =
    match v with
    | `Uchar u -> 
      Buffer.add_utf_8_uchar b u; 
      add acc `Await
    | `Await | `End -> failwith "don't know what to do"
  in
  let (_, r) = 
    Uuseg_string.fold_utf_8 
      (`Custom (Uuseg.custom ~add:add))
      (fun (p, acc) ch -> if (p >= pos) && (p <= pos+len-1) then (p+1, acc ^ ch) else (p+1, acc))
      (0, "")
      s 
    in 
  r

And the compilation returned an error that I don't know how to fix:

File "lib/utility.ml", line 45, characters 6-39:
45 |       (`Custom (Uuseg.custom ~add:add))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Error: This expression has type
         [> `Custom of
              ?mandatory:(string list -> bool) ->
              name:string ->
              create:(unit -> string list) ->
              copy:(string list -> string list) -> unit -> Uuseg.custom ]
       but an expression was expected of type [< Uuseg.boundary ]
       Types for tag `Custom are incompatible
make: *** [lib/utility.cmo] Error 2

Could anyone help me write this substring function of Unicode strings by scalars?

Upvotes: 0

Views: 81

Answers (1)

ivg
ivg

Reputation: 35210

And the compilation returned an error that I don't know how to fix:

The Uuseg.custom function creates a custom segmenter and takes a few parameters (you passed only one),

val custom :
  ?mandatory:('a -> bool) ->
  name:string ->
  create:(unit -> 'a) ->
  copy:('a -> 'a) ->
  add: ('a -> [ `Uchar of Uchar.t | `Await | `End ] -> ret) -> unit -> custom

So you need to pass also the name, create, copy parameters as well as the positional () parameter. But I don't think that this is the function that you should use.

Could anyone help me write this substring function of Unicode strings by scalars?

Yes, if we will follow the advice and implement it "by using Uutf.String.fold_utf_8 and Buffer.add_utf_8_uchar", it is very easy. (Notice, that we were advised to use Uutf.String.fold_utf_8 not Uuseg_string.fold_utf_8).

A simple implementation (that doesn't do a lot of error checking), will look like this,

let substring s pos len =
  let buf = Buffer.create len in
  let _ : int = Uutf.String.fold_utf_8 (fun off _ elt ->
      match elt with
      | `Uchar x when off >= pos && off < pos + len ->
        Buffer.add_utf_8_uchar buf x;
        off + 1
      | _ -> off + 1) 0 s in
  Buffer.contents buf

Here is how it works (using my name as a working example),

# substring "Иван\n\rГотовчиц" 0 5;;
- : string = "Иван\n"
# substring "Иван\n\rГотовчиц" 11 3;;
- : string = "чиц"

And it works fine with the right-to-left scripts,

# let shalom = substring "שָׁלוֹ";;
val shalom : int -> int -> string = <fun>
# shalom 0 1;;
- : string = "ש"
# shalom 0 2;;
- : string = "שָ"
# shalom 2 2;;
- : string = "ׁל"
# shalom 2 1;;
- : string = "ׁ"

Upvotes: 1

Related Questions