Reputation: 752
I'm using a tool that doesn't have any specific HTML parsing capability. It does have regex replace functionality (based on the Boost library), and I'm able to use that to convert a lot of the formatting. I understand that it's imperfect, but it's "good enough".
Lists are proving a bit trickier. I know that I could use some script code to iterate through these, but given the power of regular expressions, I feel that it should be possible.
My input can contain something like this:
<p>Numbered bullet list:</p>
<ol>
<li>Item 1</li>
<li>Item 2</li>
<li>Item 3</li>
</ol>
<p>Standard bullet list:</p>
<ul>
<li>Item A</li>
<li>Item B</li>
<li>Item C</li>
<li>Item D</li>
</ul>
And I'd like to convert this to:
Numbered bullet list:
# Item 1
# Item 2
# Item 3
Standard bullet list:
* Item A
* Item B
* Item C
* Item D
I already have a first pass that will remove the paragraph tags, so these can be ignored for the purpose of this question. If I only have one list type, I can do a simple replace of the list tags. Is it possible to do the conversion for text containing both list types using only regexes?
Thanks!
Upvotes: 0
Views: 128
Reputation: 11
I mean ... technically this is possible, but the problem here is that you might not know how many items you have.
You could do something like
<ol>([\s\w<>/]+)<li>([^<]+)<\/li>
and replace that with
<ol>$1# $2
if you execute that n times (whatever arbitrary number this is), you would have basically the first list done.
The same goes with the <ul>: replace <ul>([\s\w<>/]+)<li>([^<]+)<\/li>
with <ul>$1* $2
After that you have something like this:
<p>Numbered bullet list:</p>
<ol>
# Item 1
# Item 2
# Item 3
</ol>
<p>Standard bullet list:</p>
<ul>
* Item A
* Item B
* Item C
* Item D
</ul>
Then you can remove the start and end tags of the lists.
PS: replacement syntax ($1) might vary depending on the tool you use
Upvotes: 1