Finding first occurance of a pattern in regex

Question

I know this has been asked a million times before so appologies for a repeat question, but this is driving me nuts. I've been working on this for ages now and dont seem to be getting anywhere.

I have some html code, that contains images floated right or left. What I need to do is find all images that are floated, remove the float and then wrap them in a div that is now floated the same way the image is.

e.g. from

to

I am using this code in Notepad++ Find

Replace with

The problem is that in a block of code containing

tags and multiple images I highlight the whole code block from beginning to end.

E.g.

Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum

Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum

Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum


Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum 
Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum

In notepad++ this matches the whole block. Can you offer any suggestions it's driving me nuts!

Adam

Ro Yo Mi · Accepted Answer

Forward

Ensure you're using the latest version of notepad++, there where known problems using regex in notepad++ v5 and before which have been corrected in v6.

Basic

Although there are a ton of edge cases where regex has difficulty handling HTML such as:

attributes can appear in any order within the tag
values of attributes can look like actual attributes such as
attribute values can use single double or no quotes

In your expression consider changing your .+ to [^"]+. This will prevent the regex engine from leaving the quoted area or tag and traveling into the next possible match

But this doesn't handle the other edge cases.

Complex

To bypass those edge cases, you could use this monster expression. I have it on multiple lines and commented here to show what is happening to help make it easier to understand. however in notepad you'll need to remove the comments and all the new lines.

Regex

)
(?=(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\ssrc=('[^']*'|"[^"]*"|[^'"][^\s>]*)) # find src, capture value including quotes if they exist
(?=(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\sborder=('[^']*'|"[^"]*"|[^'"][^\s>]*))  # find border, capture value including quotes if they exist
(?=(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\salt=('[^']*'|"[^"]*"|[^'"][^\s>]*)) # find alt, capture value including quotes if they exist
(?=(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\swidth=('[^']*'|"[^"]*"|[^'"][^\s>]*))   # find width, capture value including quotes if they exist
(?=(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\sheight=('[^']*'|"[^"]*"|[^'"][^\s>]*))  # find height, capture value including quotes if they exist
(?=(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\sstyle="[^"]*(float:\s*(?:right|left)))  # find style, capture value including quotes if they exist
[^>]*>                      # actually capture the string

Replace with

This is the single line expression inserted into my notepad example. I'm using notepad++ v6.3.3

)(?=(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\ssrc=('[^']*'|"[^"]*"|[^'"][^\s>]*))(?=(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\sborder=('[^']*'|"[^"]*"|[^'"][^\s>]*))(?=(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\salt=('[^']*'|"[^"]*"|[^'"][^\s>]*))(?=(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\swidth=('[^']*'|"[^"]*"|[^'"][^\s>]*))(?=(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\sheight=('[^']*'|"[^"]*"|[^'"][^\s>]*))(?=(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\sstyle="[^"]*(float:\s*(?:right|left)))[^>]*>

enter image description here

Expanded

match the image tag


(?=\s|>) look ahead to ensure the image tag name is followed by a space or close angle bracket
(?= look ahead, this particular one finds the src attribute, but the idea is the same on all the others. The look ahead allows the attributes to appear in any order inside the tag because after the look ahead is satisfied the regex engine returns to the where the lookahead started and continues with the rest of the expression.


(?: non capture group moves the regex cursor through the string, skipping over all the quoted attribute values. This is the magic that bypasses the attribute values which could be mistaken as a desirable attribute name.
[^>=] match all characters which are not close brackets or equal signs
| or
='[^']*' match an equal sign followed by single quotes, all text inside the single quotes and close single quote
| or
="[^"]*" match an equal sign followed by double quotes, all text inside the double quotes and close double quote
| or
=[^'"][^\s>]* an equal sign followed by a non quote character which is followed by any number of characters which are not spaces or close angle brackets
)*? close the non capture group, and allow it to repeat as many times as necessary. The capturing will not leave the tag so if the next condition is not met then this particular tag is not the tag we are looking for

\ssrc= match an space followed by src=. Thanks to the above non-capture group this can only be an attribute name
( start capture group this will get the value of the src attribute


'[^']*' match an equal sign followed by single quotes, all text inside the single quotes and close single quote
| or
"[^"]*" match an equal sign followed by double quotes, all text inside the double quotes and close double quote
| or
[^'"][^\s>]* an equal sign followed by a non quote character which is followed by any number of characters which are not spaces or close angle brackets
) close the capture group

) close the lookahead
These next lookahead all follow the same logic as the above src


(?=(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\sborder=('[^']*'|"[^"]*"|[^'"][^\s>]*)) find border, capture value including quotes if they exist
(?=(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\salt=('[^']*'|"[^"]*"|[^'"][^\s>]*)) find alt, capture value including quotes if they exist
(?=(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\swidth=('[^']*'|"[^"]*"|[^'"][^\s>]*)) find width, capture value including quotes if they exist
(?=(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\sheight=('[^']*'|"[^"]*"|[^'"][^\s>]*)) find height, capture value including quotes if they exist
(?=(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\sstyle="[^"]*(float:\s*(?:right|left)))find style, capture value this one is slightly different because of how the actual attribute value is matched

[^>]*> match the rest of the img tag and close bracket, this prevents the regex engine from accidentally finding an included attribute which may have a value which could be mistaken as another img tag.

Finding first occurance of a pattern in regex

Answers (2)

Forward

Basic

Complex

Expanded

Related Questions