Reputation:
I am having some trouble with this regex:
<img(.+)src="_image/([0-9]*)/(.+)/>
Global and case insensitive flags is on.
The problem is that it also grabs Image n (see string below), but I want it only to match the image tags in the string.
<p>Image 1:<img width="199" src="_image/12/label" alt=""/> Image 2: <img width="199" src="_image/12/label" alt=""/><img width="199" src="_image/12/label" alt=""/></p>
It works if I put a newline before Image n :)
Can anyone point out for me what I am doing wrong?
Thanks in advance bob
Upvotes: 1
Views: 2551
Reputation: 39864
If I interpret your regex correctly, it looks like you're after the directory name in the first group and the file path in the second group?
<IMG.*?SRC="/_image/(\d+?)/([^"]*?)".*?/>
Don't forget to use the regex options CaseInsensitive which wraps the regex with (?i:[regex])
In the second group, you're parsing everything that is not the closing ", right now you're looking for all characters, in fact, you don't need to search all characters, you want everything that isn't the closing quote on the string.
Also, don't forget to close your SRC string which you're missing, and that the SRC attribute may not be the last in the tag - for instance border, width, height etc. Also, there may be any number of spaces after the closure of the last attribute and the end of tag />
From this regex, your first match group will hold the subdirectory name and the second match group will hold everything after the / of the subdirectory - including nested subdirectories. If you've got nested subdirectories, you may need to expand this slightly:
<IMG.*?SRC="/_image/((\d+?)/)+?([^"]*?)".*?/>
In this case, each of the leading groups will hold each of the nested directory names, and the last group will hold the file name.
Upvotes: 1
Reputation: 29854
You're using a greedy quantifier (+) without much restriction. A greedy quantifier is telling the regex engine: "Grab every character that qualifies and only back off enough to complete the regex." That means that it will get from the first sequence of the characters "image/nnnnnn/something/".
Upvotes: 0
Reputation: 64939
Regexes are fundamentally bad at parsing HTML (see Can you provide some examples of why it is hard to parse XML and HTML with a regex? for why). What you need is an HTML parser. See Can you provide an example of parsing HTML with your favorite parser? for examples using a variety of parsers.
Upvotes: 0
Reputation: 111288
Have you tried lazy evaluation? That worked sometime back when I tried something similar.
Upvotes: 0