Reputation: 104029
I'm trying to extract all matches from a EBML definition, which is something like this:
| + A track
| + Track number: 3
| + Track UID: 724222477
| + Track type: subtitles
...
| + Language: eng
...
| + A track
| + Track number: 4
| + Track UID: 745646561
| + Track type: subtitles
...
| + Language: jpn
...
I want all occurrences of "Language: ???" when preceded by "Track type: subtitles". I tried several variations of this:
Track type: subtitles.*Language: (\w\w\w)
I'm using the multi-line modifier in Ruby so it matches newlines (like the 's' modifier in other languages).
This works to get the last occurrence, which in the example above, would be 'jpn', for example:
string.scan(/Track type: subtitles.*Language: (\w\w\w)/m)
=> [["jpn"]]
The result I'd like:
=> [["eng"], ["jpn"]]
What would be a correct regex to accomplish this?
Upvotes: 3
Views: 563
Reputation: 89221
You need to use a lazy quantifier instead of .*
. Try this:
/Track type: subtitles.*?Language: (\w\w\w)/m
This should get you the first occurrence of "Language: ???
" after each "Track type: subtitles:
". But it would get confused if some track (of type subtitles
) would be missing the Language
field.
Another way to do this would be:
/^\| \+ (?:(?!^\| \+).)*?\+ Track type: subtitles$(?:(?!^\| \+).)*?^\| \+ Language: (\w+)$/m
Looks somewhat messy, but should take care of the problem with the previous one.
A cleaner way would be to tokenize the string:
/^\| \+ ([^\r\n]+)|^\| \+ Track type: (subtitles)|^\| \+ Language: (\w+)/m
(Take note of the number of spaces)
For each match, you check which of the capture groups that are defined. Only one group will have any value for any single match.
subtitles
.subtitles
, report it.Upvotes: 3
Reputation: 176743
You need to make your regex non-greedy by changing this:
.*
To this:
.*?
Your regex is matching from the first occurence of Track type: subtitles
to the last occurence of Language: (\w\w\w)
. Making it non-greedy will work because it matches as few characters as possible.
Upvotes: 7