bob
bob

Reputation:

match image tags with regEx

I am having some trouble with this regex:

<img(.+)src="_image/([0-9]*)/(.+)/> 

Global and case insensitive flags is on.

The problem is that it also grabs Image n (see string below), but I want it only to match the image tags in the string.

<p>Image 1:<img width="199" src="_image/12/label" alt=""/> Image 2: <img width="199" src="_image/12/label" alt=""/><img width="199" src="_image/12/label" alt=""/></p>

It works if I put a newline before Image n :)

Can anyone point out for me what I am doing wrong?

Thanks in advance bob

Upvotes: 1

Views: 2551

Answers (5)

BenAlabaster
BenAlabaster

Reputation: 39864

If I interpret your regex correctly, it looks like you're after the directory name in the first group and the file path in the second group?

<IMG.*?SRC="/_image/(\d+?)/([^"]*?)".*?/>

Don't forget to use the regex options CaseInsensitive which wraps the regex with (?i:[regex])

In the second group, you're parsing everything that is not the closing ", right now you're looking for all characters, in fact, you don't need to search all characters, you want everything that isn't the closing quote on the string.

Also, don't forget to close your SRC string which you're missing, and that the SRC attribute may not be the last in the tag - for instance border, width, height etc. Also, there may be any number of spaces after the closure of the last attribute and the end of tag />

From this regex, your first match group will hold the subdirectory name and the second match group will hold everything after the / of the subdirectory - including nested subdirectories. If you've got nested subdirectories, you may need to expand this slightly:

<IMG.*?SRC="/_image/((\d+?)/)+?([^"]*?)".*?/>

In this case, each of the leading groups will hold each of the nested directory names, and the last group will hold the file name.

Upvotes: 1

Axeman
Axeman

Reputation: 29854

You're using a greedy quantifier (+) without much restriction. A greedy quantifier is telling the regex engine: "Grab every character that qualifies and only back off enough to complete the regex." That means that it will get from the first sequence of the characters "image/nnnnnn/something/".

Upvotes: 0

Chas. Owens
Chas. Owens

Reputation: 64939

Regexes are fundamentally bad at parsing HTML (see Can you provide some examples of why it is hard to parse XML and HTML with a regex? for why). What you need is an HTML parser. See Can you provide an example of parsing HTML with your favorite parser? for examples using a variety of parsers.

Upvotes: 0

Bill Dueber
Bill Dueber

Reputation:

Use a non-greedy regexp:

<img .? src="_image/(\d+)/(.+?)/.?>

Upvotes: 1

dirkgently
dirkgently

Reputation: 111288

Have you tried lazy evaluation? That worked sometime back when I tried something similar.

Upvotes: 0

Related Questions