bottlenecked
bottlenecked

Reputation: 2157

elixir String.split/3: how to tell which character was matched

I'm trying to build a parser using String.split/3 as an exercise just to see if I can make it faster than classic recursive pattern matching (by effectively allocating fewer strings), but I'm having trouble making it correct.

I need to be able to split binaries in 2 parts at either a { or a }, and String.split(str, ["{", "}"], parts: 2) does exactly that of course. My problem is that the result contains only the list [part1, part2] but it does not contain the character it matched on, which I need because it affects the behavior of the parser.

My first instinct was to make my own index/2 but after reading up some more it seems like a bad idea

Is there a better alternative for my use case? I'd like to go through the string as few times as possible, creating new strings only at { and } boundaries.

Thank you for reading!

Upvotes: 2

Views: 1122

Answers (2)

Dogbert
Dogbert

Reputation: 222060

Since you only want to split at the first occurrence, I'd suggest using :binary.match/3 and :binary.part/3 here. :binary.match/3 returns a tuple of index of the match and the length of the match on success, which can then be used with :binary.part/2 to split the binary.

defmodule A do
  def split(binary) do
    case :binary.match(binary, ["{", "}"]) do
      {start, length} ->
        before = :binary.part(binary, 0, start)
        match = :binary.part(binary, start, length)
        after_ = :binary.part(binary, start + length, byte_size(binary) - (start + length))
        {before, match, after_}
      :nomatch -> nil
    end
  end
end

IO.inspect A.split("foo { bar") |> IO.inspect
IO.inspect A.split("foo } bar") |> IO.inspect
IO.inspect A.split("foo + bar") |> IO.inspect

Output:

{"foo ", "{", " bar"}
{"foo ", "{", " bar"}
{"foo ", "}", " bar"}
{"foo ", "}", " bar"}
nil
nil

This implementation goes through the string only once, in the :binary.match/3 call. All other functions and operations used are O(1).

I've used a similar method in an XML parser I was working on and it provided a huge speed up compared to recursive pattern matching on the binary.

Edit: you can use pattern matching to make this code way shorter, and I'm pretty sure with the exact same performance as above:

defmodule A do
  def split(binary) do
    case :binary.match(binary, ["{", "}"]) do
      {start, length} ->
        <<a::binary-size(start), b::binary-size(length), c::binary>> = binary
        {a, b, c}
      :nomatch -> nil
    end
  end
end

Upvotes: 3

Austin Salonen
Austin Salonen

Reputation: 50215

Dig a little deeper in the String.split docs and use the include_captures option.

iex> String.split("Foo{bar}", ~r({|}), [include_captures: true, trim: true])
["Foo", "{", "bar", "}"]

Upvotes: 3

Related Questions