Reputation:

match image tags with regEx

I am having some trouble with this regex:

<img(.+)src="_image/([0-9]*)/(.+)/>

Global and case insensitive flags is on.

The problem is that it also grabs Image n (see string below), but I want it only to match the image tags in the string.

<p>Image 1:<img width="199" src="_image/12/label" alt=""/> Image 2: <img width="199" src="_image/12/label" alt=""/><img width="199" src="_image/12/label" alt=""/></p>

It works if I put a newline before Image n :)

Can anyone point out for me what I am doing wrong?

Thanks in advance bob

Upvotes: 1

Answers (5)

BenAlabaster

Reputation: 39864

If I interpret your regex correctly, it looks like you're after the directory name in the first group and the file path in the second group?

<IMG.*?SRC="/_image/(\d+?)/([^"]*?)".*?/>

Don't forget to use the regex options CaseInsensitive which wraps the regex with (?i:[regex])

In the second group, you're parsing everything that is not the closing ", right now you're looking for all characters, in fact, you don't need to search all characters, you want everything that isn't the closing quote on the string.

Also, don't forget to close your SRC string which you're missing, and that the SRC attribute may not be the last in the tag - for instance border, width, height etc. Also, there may be any number of spaces after the closure of the last attribute and the end of tag />

From this regex, your first match group will hold the subdirectory name and the second match group will hold everything after the / of the subdirectory - including nested subdirectories. If you've got nested subdirectories, you may need to expand this slightly:

<IMG.*?SRC="/_image/((\d+?)/)+?([^"]*?)".*?/>

In this case, each of the leading groups will hold each of the nested directory names, and the last group will hold the file name.

Upvotes: 1

Axeman

Reputation: 29854

You're using a greedy quantifier (+) without much restriction. A greedy quantifier is telling the regex engine: "Grab every character that qualifies and only back off enough to complete the regex." That means that it will get from the first sequence of the characters "image/nnnnnn/something/".

Upvotes: 0