Samuel Norbury
Samuel Norbury

Reputation: 23

Regex that starts and ends with specific string, and does not have (other) string in the middle

I'm trying to match strings in a html document that start and end with specific strings, and do not have another specific string in the middle. More specifically, they start with

$start = "<br/>\s*[0-9]{1,4}(\.|\:|\))+";

end with

$end = "\?";

and should contain everything BUT breaklines in the middle.

Currently my middle regex looks like this:

$middle = "[^(<br/>)]+";

Final code will look like this:

$start = "<br/>\s*[0-9]{1,4}(\.|\:|\))+";
$middle = //What do I put here?
$end = "\?";
$regex = "#".$start.$middle.$end."#";
preg_match_all($regex, $text, $hits);

How should I create my middle regex to only match on text that doesn't contain breaklines?

Upvotes: 2

Views: 672

Answers (2)

show-me-the-code
show-me-the-code

Reputation: 1553

If you are looking to match any html text between <br /> and ?:

  • that doesn't contain any other <br />, then this expression works:

    <br\s?\/>\s*([0-9]{1,4})[.:)]((?:(?!<br\s?\/>).)*)\?

Take a look at this demo.

  • that may contain <br /> but you are only interested in the text with the <br /> removed, then you should probably match everything between the <br /> and the ? like so:

    <br\s?\/>\s*([0-9]{1,4})[.:)]([^?]*)\?

and remove the <br /> with string replace or something. Take a look at this demo.

In each case, The first group will match your bullet point number, and the second group will match the question following the bullet point, assuming that is what you are interested in. The above expressions allow inconsistent tags such as <br>, <br > or <br/> or <br />.

Upvotes: 0

Sam
Sam

Reputation: 20486

If you use an expression like this, you should get the result you expect (although, there are better ways to parse HTML):

(?:(?!<br/>).)*

This is essentially .* on steroids. (?:...) is a "non-capturing" group used to group everything together for the * repetition. (?!...) is a negative lookahead, meaning it makes sure that <br/> isn't found ahead of the current location. So, this expression makes sure there isn't a <br/> then matches the next character and then repeats!


In your expression, [^(<br/>)]+, you're misunderstanding how character classes work. That is saying match any character 1+ times as long as it is not in the following set of characters: (, b, r, /,>, ). Maybe this demo will explain it better.

Upvotes: 2

Related Questions