Reputation: 80
I am trying to use vba regular expressions to find images in html code. In the image name examples below, the pattern I have only finds the second image but not the first image.
.Pattern = "<img\s*src=""([^""]*)"""
<img width="100%" src="red_blue.jpg">
<img src="img7993xyz71.jpg">
Upvotes: 1
Views: 745
Reputation: 15010
The problem with using a .*?
is that if the img tag doesn't have a src attribute, then you might match more text then you're interested, or you might accidentally find the src attribute of a subsequent non-img tag.
This regex will capture the entire img tag, and will pull out the src attribute value. If the img tag doesn't have an src attribute, then the img tag will be skipped.
Regex: <img\b(?=\s)(?=(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\ssrc=('[^']*'|"[^"]*"|[^'"][^\s>]*))(?:[^>=]|='[^']*'|="[^"]*"|=[^'"\s]*)*"\s?>
Sample Text
Note the second line has some difficult edge cases
<img width="100%" src="red_blue.jpg">
<img onmouseover=' var src="NotRealImage.png" ; funImageSwap(src); '><form><input type="image" src="submit.gif"></form>
<img src="img7993xyz71.jpg">
Code
I realize this example is vb.net and not vba, I'm only including this to show that the solution will work with the .net regex engine.
VB.NET Code Example:
Imports System.Text.RegularExpressions
Module Module1
Sub Main()
Dim sourcestring as String = "replace with your source string"
Dim re As Regex = New Regex("<img\b(?=\s) # capture the open tag
(?=(?:[^>=]|='[^']*'|=""[^""]*""|=[^'""][^\s>]*)*?\ssrc=('[^']*'|""[^""]*""|[^'""][^\s>]*)) # get the href attribute
(?:[^>=]|='[^']*'|=""[^""]*""|=[^'""\s]*)*""\s?> # get the entire tag
",RegexOptions.IgnoreCase OR RegexOptions.IgnorePatternWhitespace OR RegexOptions.Multiline OR RegexOptions.Singleline)
Dim mc as MatchCollection = re.Matches(sourcestring)
Dim mIdx as Integer = 0
For each m as Match in mc
For groupIdx As Integer = 0 To m.Groups.Count - 1
Console.WriteLine("[{0}][{1}] = {2}", mIdx, re.GetGroupNames(groupIdx), m.Groups(groupIdx).Value)
Next
mIdx=mIdx+1
Next
End Sub
End Module
Matches
[0][0] = <img width="100%" src="red_blue.jpg">
[0][1] = "red_blue.jpg"
[1][0] = <img src="img7993xyz71.jpg">
[1][1] = "img7993xyz71.jpg"
Upvotes: 1