Иван Бишевац
Иван Бишевац

Reputation: 14641

Regex to match 1 or 2 occurrences

I have text with following structure:

book_name:SoftwareEngineering;author:John;author:Smith; book_name:DesignPatterns;author:Foo;author:Bar;

Element separator is ;

Two author elements could follow book_name element

There could be 2 to 10 books

One book should have at least one author, but maximum 2 authors

I would like to extract book_name and individual authors for every book.

I tried regex with .scan method (which collects all matches):

iex> regex = ~r/book_name:(.+?;)(author:.+?;){1,2}/
iex> text = "book_name:SoftwareEngineering;author:John;author:Smith;book_name:DesignPatterns;author:Foo;author:Bar;"

iex> Regex.scan(regex, text, capture: :all_but_first)
[["SoftwareEngineering;", "author:Smith;"], ["DesignPatterns;", "author:Bar;"]]

But it doesn't collect authors correctly. It collects only second author of the book. Can anybody help with the problem?

Upvotes: 3

Views: 5958

Answers (3)

The fourth bird
The fourth bird

Reputation: 163362

This part (author:.+?;){1,2} of the pattern repeats 1-2 times author including what follows up till the semicolon but repeating the capturing group like that will only give you the last capturing group. This page might be helpful.

Instead of using a non greedy quantifier .*? you could match not a semicolon repeating a negated character class [^;]+ that matches not the semicolon.

You might also make use of a capturing group and a backreference for author. The name of the book is in capturing group 1, the name of the first author in group 3 and the optional second author in group 4.

book_name:([^;]+);(author):([^;]+);(?:\2:([^;]+);)?

That will match

  • book_name: Match literally
  • ([^;]+); Group 1 matching not ; then match ;
  • (author): Group 2 author
  • ([^;]+); Group 3 matching not ; then match ;
  • (?: Non capturing group
    • \2: backreference to what is captured in group 2
    • ([^;]+); Group 4 matching not ; then match ;
  • )? Close non capturing group and make it optional

regex101 demo

Upvotes: 3

Adam Millerchip
Adam Millerchip

Reputation: 23091

You don't need regex for that, you can use String.split/3:

defmodule Book do
  def extract(text) do
    text
    |> String.split("book_name:", trim: true)
    |> Enum.map(&String.split(&1, [":", ";"], trim: true))
    |> Enum.map(fn [title, _, author1, _, author2] -> {title, author1, author2} end)
  end
end

Output:

iex> Book.extract(text)
[{"SoftwareEngineering", "John", "Smith"}, {"DesignPatterns", "Foo", "Bar"}]

For simplicity I assumed there were always two authors. The last Enum can be replaced with this one, which handles the case where there is no second author too:

|> Enum.map(fn
  [title, _, author1] -> {title, author1, nil}
  [title, _, author1, _, author2] -> {title, author1, author2}
end)

Upvotes: 1

Snow
Snow

Reputation: 4097

In many engines, including Elixir's, you can't repeat multiple capture groups like that and get the result for each repeated group - you'll only get the last result of any given repeated capture group. Rather, write out each possible group individually, and then filter out empty matches:

book_name:(.+?;)author:(.+?);(?:author:(.+?);)?

https://regex101.com/r/LPgzcG/1

Upvotes: 2

Related Questions