Reputation: 370
I know this has been asked a million times before so appologies for a repeat question, but this is driving me nuts. I've been working on this for ages now and dont seem to be getting anywhere.
I have some html code, that contains images floated right or left. What I need to do is find all images that are floated, remove the float and then wrap them in a div that is now floated the same way the image is.
e.g. from
<img src="images/imagepath1.jpg" border="0" alt="image 1" width="200" height="206" style="float: right;" />
to
<div class="imgContainer" style="float: right;"><img src="images/imagepath1.jpg" border="0" alt="image 1" width="200" height="206" /></div>
I am using this code in Notepad++ Find
<img src="(.+)" border="([0-9]{1})" alt="(.*?)" width="([0-9]{2,3})" height="([0-9]{3})" style="float: (right|left);" />
Replace with
<div class="imgContainer" style="float: \6;"><img src="\1" border="\2" alt="\3" width="\4" height="\5" /></div>
The problem is that in a block of code containing <p>
tags and multiple images I highlight the whole code block from beginning to end.
E.g.
<img src="images/imagepath1.gif" border="0" alt="image 1" width="207" height="119" style="float: right;" /><p>Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum</p><p>Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum </p><p>Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum</p>
<img src="images/imagepath2.jpg" border="0" alt="image2" width="96" height="141" style="float: left;" /><p>Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum </p><p>Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum </p><img src="images/imagepath3.gif" border="0" alt="image 3" width="72" height="108" style="float: right;" />
In notepad++ this matches the whole block. Can you offer any suggestions it's driving me nuts!
Adam
Upvotes: 1
Views: 204
Reputation: 15010
Ensure you're using the latest version of notepad++, there where known problems using regex in notepad++ v5 and before which have been corrected in v6.
Although there are a ton of edge cases where regex has difficulty handling HTML such as:
<img onmouseover=' src="TheseAreNotTheDroidsYouAreLookingFor.png" ; funImageSwap(src); ' src="DecoyDroids.png">
In your expression consider changing your .+
to [^"]+
. This will prevent the regex engine from leaving the quoted area or tag and traveling into the next possible match
<img src="([^"]+)" border="([0-9]{1})" alt="([^"]*?)" width="([0-9]{2,3})" height="([0-9]{3})" style="float: (right|left);" />
But this doesn't handle the other edge cases.
To bypass those edge cases, you could use this monster expression. I have it on multiple lines and commented here to show what is happening to help make it easier to understand. however in notepad you'll need to remove the comments and all the new lines.
Regex
<img(?=\s|>)
(?=(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\ssrc=('[^']*'|"[^"]*"|[^'"][^\s>]*)) # find src, capture value including quotes if they exist
(?=(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\sborder=('[^']*'|"[^"]*"|[^'"][^\s>]*)) # find border, capture value including quotes if they exist
(?=(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\salt=('[^']*'|"[^"]*"|[^'"][^\s>]*)) # find alt, capture value including quotes if they exist
(?=(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\swidth=('[^']*'|"[^"]*"|[^'"][^\s>]*)) # find width, capture value including quotes if they exist
(?=(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\sheight=('[^']*'|"[^"]*"|[^'"][^\s>]*)) # find height, capture value including quotes if they exist
(?=(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\sstyle="[^"]*(float:\s*(?:right|left))) # find style, capture value including quotes if they exist
[^>]*> # actually capture the string
Replace with
<div class="imgContainer" style="$6;"><img src=$1 border=$2 alt=$3 width=$4 height=$5 /></div>
This is the single line expression inserted into my notepad example. I'm using notepad++ v6.3.3
<img(?=\s|>)(?=(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\ssrc=('[^']*'|"[^"]*"|[^'"][^\s>]*))(?=(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\sborder=('[^']*'|"[^"]*"|[^'"][^\s>]*))(?=(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\salt=('[^']*'|"[^"]*"|[^'"][^\s>]*))(?=(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\swidth=('[^']*'|"[^"]*"|[^'"][^\s>]*))(?=(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\sheight=('[^']*'|"[^"]*"|[^'"][^\s>]*))(?=(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\sstyle="[^"]*(float:\s*(?:right|left)))[^>]*>
<img
match the image tag(?=\s|>)
look ahead to ensure the image tag name is followed by a space or close angle bracket(?=
look ahead, this particular one finds the src attribute, but the idea is the same on all the others. The look ahead allows the attributes to appear in any order inside the tag because after the look ahead is satisfied the regex engine returns to the where the lookahead started and continues with the rest of the expression.
(?:
non capture group moves the regex cursor through the string, skipping over all the quoted attribute values. This is the magic that bypasses the attribute values which could be mistaken as a desirable attribute name.[^>=]
match all characters which are not close brackets or equal signs|
or='[^']*'
match an equal sign followed by single quotes, all text inside the single quotes and close single quote|
or="[^"]*"
match an equal sign followed by double quotes, all text inside the double quotes and close double quote|
or=[^'"][^\s>]*
an equal sign followed by a non quote character which is followed by any number of characters which are not spaces or close angle brackets)*?
close the non capture group, and allow it to repeat as many times as necessary. The capturing will not leave the tag so if the next condition is not met then this particular tag is not the tag we are looking for\ssrc=
match an space followed by src=
. Thanks to the above non-capture group this can only be an attribute name(
start capture group this will get the value of the src attribute
'[^']*'
match an equal sign followed by single quotes, all text inside the single quotes and close single quote|
or"[^"]*"
match an equal sign followed by double quotes, all text inside the double quotes and close double quote|
or[^'"][^\s>]*
an equal sign followed by a non quote character which is followed by any number of characters which are not spaces or close angle brackets)
close the capture group)
close the lookahead(?=(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\sborder=('[^']*'|"[^"]*"|[^'"][^\s>]*))
find border, capture value including quotes if they exist(?=(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\salt=('[^']*'|"[^"]*"|[^'"][^\s>]*))
find alt, capture value including quotes if they exist(?=(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\swidth=('[^']*'|"[^"]*"|[^'"][^\s>]*))
find width, capture value including quotes if they exist(?=(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\sheight=('[^']*'|"[^"]*"|[^'"][^\s>]*))
find height, capture value including quotes if they exist(?=(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\sstyle="[^"]*(float:\s*(?:right|left)))
find style, capture value this one is slightly different because of how the actual attribute value is matched[^>]*>
match the rest of the img tag and close bracket, this prevents the regex engine from accidentally finding an included attribute which may have a value which could be mistaken as another img tag.Upvotes: 1
Reputation: 71578
I would say that you're on the right path an only one character away from the regex find/replace you're come up with.
This is your current find:
<img src="(.+)" border="([0-9]{1})" alt="(.*?)" width="([0-9]{2,3})" height="([0-9]{3})" style="float: (right|left);" />
Change it to:
v
<img src="(.+?)" border="([0-9]{1})" alt="(.*?)" width="([0-9]{2,3})" height="([0-9]{3})" style="float: (right|left);" />
The v
is showing where I introduced the 1 character you are currently missing. Once you make this .+
lazy, you should be able to get the correct replaces and not a single replace for the whole thing.
That said, I too would advise using [^"]
instead of .
in such cases.
Upvotes: 1