Reputation: 14641
I have text with following structure:
book_name:SoftwareEngineering;author:John;author:Smith; book_name:DesignPatterns;author:Foo;author:Bar;
Element separator is ;
Two author elements could follow book_name element
There could be 2 to 10 books
One book should have at least one author, but maximum 2 authors
I would like to extract book_name and individual authors for every book.
I tried regex with .scan
method (which collects all matches):
iex> regex = ~r/book_name:(.+?;)(author:.+?;){1,2}/
iex> text = "book_name:SoftwareEngineering;author:John;author:Smith;book_name:DesignPatterns;author:Foo;author:Bar;"
iex> Regex.scan(regex, text, capture: :all_but_first)
[["SoftwareEngineering;", "author:Smith;"], ["DesignPatterns;", "author:Bar;"]]
But it doesn't collect authors correctly. It collects only second author of the book. Can anybody help with the problem?
Upvotes: 3
Views: 5958
Reputation: 163362
This part (author:.+?;){1,2}
of the pattern repeats 1-2 times author
including what follows up till the semicolon but repeating the capturing group like that will only give you the last capturing group. This page might be helpful.
Instead of using a non greedy quantifier .*?
you could match not a semicolon repeating a negated character class [^;]+
that matches not the semicolon.
You might also make use of a capturing group and a backreference for author
. The name of the book is in capturing group 1, the name of the first author in group 3 and the optional second author in group 4.
book_name:([^;]+);(author):([^;]+);(?:\2:([^;]+);)?
That will match
book_name:
Match literally([^;]+);
Group 1 matching not ;
then match ;
(author):
Group 2 author
([^;]+);
Group 3 matching not ;
then match ;
(?:
Non capturing group
\2:
backreference to what is captured in group 2([^;]+);
Group 4 matching not ;
then match ;
)?
Close non capturing group and make it optionalUpvotes: 3
Reputation: 23091
You don't need regex for that, you can use String.split/3
:
defmodule Book do
def extract(text) do
text
|> String.split("book_name:", trim: true)
|> Enum.map(&String.split(&1, [":", ";"], trim: true))
|> Enum.map(fn [title, _, author1, _, author2] -> {title, author1, author2} end)
end
end
Output:
iex> Book.extract(text)
[{"SoftwareEngineering", "John", "Smith"}, {"DesignPatterns", "Foo", "Bar"}]
For simplicity I assumed there were always two authors. The last Enum can be replaced with this one, which handles the case where there is no second author too:
|> Enum.map(fn
[title, _, author1] -> {title, author1, nil}
[title, _, author1, _, author2] -> {title, author1, author2}
end)
Upvotes: 1
Reputation: 4097
In many engines, including Elixir's, you can't repeat multiple capture groups like that and get the result for each repeated group - you'll only get the last result of any given repeated capture group. Rather, write out each possible group individually, and then filter out empty matches:
book_name:(.+?;)author:(.+?);(?:author:(.+?);)?
https://regex101.com/r/LPgzcG/1
Upvotes: 2