Reputation: 185
I need to process a HTML content and replace the IMG SRC value with the actual data. For this I have choose Regular Expressions.
In my first attempt I need to find the IMG tags. For this I am using the following expression:
<img.*src.*=\s*".*"
Then within the IMG tag I am looking for SRC="..." and replace it with the new SRC value. I am using the following expression to get the SRC:
src\s*=\s*".*"\s*
The second expression having issues:
For the following text it works:
<img alt="3D""" hspace=
"3D0" src="3D"cid:TDCJXACLPNZD.hills.jpg"" align=
"3dbaseline" border="3d0" />
But for the following it does not:
<img alt="3D""" hspace="3D0" src=
"3D"cid:UHYNUEWHVTSH.lilies.jpg"" align="3dbaseline"
border="3d0" />
What happens is the expression returns
src="3D"cid:TDCJXACLPNZD.hills.jpg"" align=
"3dbaseline"
It does not return only the src part as expected.
I am using C++ Boost regex library.
Please help me to figure out the problem.
Thanks, Hilmi.
Upvotes: 0
Views: 220
Reputation: 14748
Your first regex doesn't work on your sample text for me. I usually use this instead, when looking for specific HTML tags:
<img[^>]*>
Also, try this for your second expression:
src\s*=\s*"[^"]*"\s*
Does that help?
Upvotes: 0
Reputation: 526613
The problem is that .*
is a "greedy" match - it will grab as much text as it possibly can while still allowing the regex to match. What you probably want is something like this:
src\s*=\s*"[^"]*"\s*
which will only match non-doublequote characters inside the src string, and thus not go past the ending doublequote.
Upvotes: 2