user11533661
user11533661

Reputation:

How to solve a greedy regular expression

I have a problem with regular expression in PHP.

This text should be handled:

Start Text1
<li>Item1</li>
<li>Item2</li>
<li>Item3</li>
End Text1
Start Text2
<li>Item1</li>
<li>Item2</li>
<li>Item3</li>
End Text2

I would like to add <ul> and </ul> to the <li> lines.

I try this, with this patter:

(?!<\/li>)\s*(<li>.*</li>)\s*(?=<li>|)

But gives something like this:

Start Text1
<ul>
<li>Item1</li>
<li>Item2</li>
<li>Item3</li>
End Text1
Start Text2
<li>Item1</li>
<li>Item2</li>
<li>Item3</li>
</ul>
End Text2

... the "End Text1" and "Start Text2" also included. So I prefer to get this result:

Start Text1
<ul>
<li>Item1</li>
<li>Item2</li>
<li>Item3</li>
</ul>
End Text1
Start Text2
<ul>
<li>Item1</li>
<li>Item2</li>
<li>Item3</li>
</ul>
End Text2

How can I do this?

I tested this here: https://www.phpliveregex.com/p/sHs#tab-preg-replace

Upvotes: 2

Views: 72

Answers (1)

joanis
joanis

Reputation: 12229

Fixing the regex

This regular expression works:

(\s*<li>.*?<\/li>\s*)(?!\s*<li>)

Explanation:

  • .*? asks the regex to match as little as possible between <li> and </li>, so that it stops as soon as there is text not within an <li>;
  • I escaped the / in the second instance of </li>, as you had already done in the first instance;
  • (?!\s*<li>) says the next bit of text cannot be another <li> - needed because otherwise .*? above makes it match each <li> line separately;
  • the initial (?!<\/li>) doesn't actually do anything, so I removed it.

Nicer handling of newlines

On the Live Regex web site, I was not able to insert newlines where I wanted to.

In php proper, you can use

preg_replace('/\s*(<li>.*?<\/li>)\s*(?!\s*<li>)/smi',
   "\n<ul>\n$1\n</ul>\n", $input)

or

preg_replace('/(\s*<li>.*?<\/li>\s*)(?!\s*<li>)/smi',
   "\n<ul>$1</ul>\n", $input)

to get nicer results. The key is to put the replacement pattern in double quotes.

Handling indented input better

If the input was indented, you might also consider something like this:

preg_replace('(\s*)(<li>.*?<\/li>)(\s*)(?!\s*<li>)/smi',
   "$1<ul>$1$2$1</ul>$3", $input)

this will put <ul> and </ul> at the same indentation level as the first <li>, and keep the surrounding text at the indentation it had beforehand.

But obviously none of this is really important given all these spacing variants won't change the interpretation of the resulting HTML.

Upvotes: 1

Related Questions