user1218122
user1218122

Reputation: 80

VBA Regular Expressions

I am trying to use vba regular expressions to find images in html code. In the image name examples below, the pattern I have only finds the second image but not the first image.

.Pattern = "<img\s*src=""([^""]*)"""

<img width="100%" src="red_blue.jpg">
<img src="img7993xyz71.jpg">

Upvotes: 1

Views: 745

Answers (1)

Ro Yo Mi
Ro Yo Mi

Reputation: 15010

Description

The problem with using a .*? is that if the img tag doesn't have a src attribute, then you might match more text then you're interested, or you might accidentally find the src attribute of a subsequent non-img tag.

This regex will capture the entire img tag, and will pull out the src attribute value. If the img tag doesn't have an src attribute, then the img tag will be skipped.

Regex: <img\b(?=\s)(?=(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\ssrc=('[^']*'|"[^"]*"|[^'"][^\s>]*))(?:[^>=]|='[^']*'|="[^"]*"|=[^'"\s]*)*"\s?>

enter image description here

Example

Sample Text

Note the second line has some difficult edge cases

<img width="100%" src="red_blue.jpg">
<img onmouseover=' var src="NotRealImage.png" ; funImageSwap(src); '><form><input type="image" src="submit.gif"></form>
<img src="img7993xyz71.jpg">

Code

I realize this example is vb.net and not vba, I'm only including this to show that the solution will work with the .net regex engine.

VB.NET Code Example:
Imports System.Text.RegularExpressions
Module Module1
  Sub Main()
    Dim sourcestring as String = "replace with your source string"
    Dim re As Regex = New Regex("<img\b(?=\s) # capture the open tag
(?=(?:[^>=]|='[^']*'|=""[^""]*""|=[^'""][^\s>]*)*?\ssrc=('[^']*'|""[^""]*""|[^'""][^\s>]*)) # get the href attribute
(?:[^>=]|='[^']*'|=""[^""]*""|=[^'""\s]*)*""\s?> # get the entire tag
",RegexOptions.IgnoreCase OR RegexOptions.IgnorePatternWhitespace OR RegexOptions.Multiline OR RegexOptions.Singleline)
    Dim mc as MatchCollection = re.Matches(sourcestring)
    Dim mIdx as Integer = 0
    For each m as Match in mc
      For groupIdx As Integer = 0 To m.Groups.Count - 1
        Console.WriteLine("[{0}][{1}] = {2}", mIdx, re.GetGroupNames(groupIdx), m.Groups(groupIdx).Value)
      Next
      mIdx=mIdx+1
    Next
  End Sub
End Module

Matches

[0][0] = <img width="100%" src="red_blue.jpg">
[0][1] = "red_blue.jpg"
[1][0] = <img src="img7993xyz71.jpg">
[1][1] = "img7993xyz71.jpg"

Upvotes: 1

Related Questions