Richard Griffiths
Richard Griffiths

Reputation: 836

Visual Basic .net Regex match does not work - despite working on a testing tool?

I've a stupidly simple question for someone - but I can't answer it myself. I've a regex pattern that works in two different online testers, one of which is .net based.

Yet here it finds no matches. Can anyone help? The purpose is to filter a lovely page of F# cheats so that it is printable :).

I'm mentoring my youngest brother, he's on week 4 of learning to code - this is his function and I confess it's stumped me! Any help I'd be very grateful for!!

  Public Function FindCode(input As String)
    Dim pattern As String = "(?m)(<pre>)(.+)(<\/pre>)\B"
    Dim output As New Dictionary(Of Integer, String)
    Dim count As Integer

    For Each match As Match In Regex.Matches(input, pattern)
        output.Add(count, match.Value)
        count += 1
    Next
Return output.count
End Function

I don't get execptions, I get no matches.

An example would be

Some random markup <pre> and this stuff in the middle is what I'm after </pre> and there </pre> lots of these in one file </pre> which when I use Regexhero <pre> finds all the tags  </pre> 

This way we would use the groups perhaps to list all the items between the pre /pre tags.

Thanks for such quick responses!

Upvotes: 2

Views: 2690

Answers (3)

Matt
Matt

Reputation: 26999

First, I've tried the expression you've provided with Expresso and then in LinqPad - both returned the entire string which is not what you've intended to match. I see 2 issues why it is not showing the desired result:

  1. The regex expression itself
  2. A problem in the example string (the tags are not pairwise, i.e. each <pre> must be closed by </pre>)

Besides that, I suggest some improvements to the code:

  1. Change the way you're matching (example below uses Regex options, and allows grouping)
  2. Add tagName as parameter, add parameter to allow inclusion or exclusion of the tags
  3. Return the collection instead of the count value

Take a look at the code, it works fine (I've added some optional, commented out .Dump() statements for LinqPad in case you want to print out the values for debugging):

Public Function FindCode(input As String, tagName as string, includeTags as boolean)
    Const grpName as string = "pregroup"
    Dim pattern As String = "(<"+tagName+">)(?<"+grpName+">(\s|\w|')+)(</"+tagName+">)"  
    Dim output As New Dictionary(Of Integer, String)
    Dim count As Integer
    
    Dim options as RegexOptions = RegexOptions.IgnoreCase _
          or RegexOptions.IgnorePatternWhitespace _
          or RegexOptions.MultiLine or RegexOptions.ExplicitCapture
    ' options.Dump("options")
    Dim rx as Regex = new Regex(pattern, options)
    For Each m As Match In rx.Matches(input)
        Dim val as string=nothing
        if (includeTags) 
            val = m.Value
        else
            if(m.Groups(grpName).Success)
                val = m.Groups(grpName).Value 
            end if
        end if
        if not (val is nothing)
            ' val.Dump("Found #" & count+1)
            output.Add(count, val)
            count += 1
        end if
    Next    
    Return output
End Function

Regarding the expression:

  • I am using (\s|\w)+ instead of .+, because it includes only whitespaces and alphanumeric characters, not brackets and hence not the tags
  • Escape characters conflicting with special characters of the Regex syntax by using \xnn (where nn is the hex code of the character) - note: this is not applicable here
  • Use a group name to easily access the content of the tags

Regarding the Regex code: I have added the parameter includeTags so you can see the difference (false excludes them, true includes them). Note that you should always set the RegexOptions properly as it affects the way the expressions are matched.

Finally, here's the main code:

Sub Main
    dim input as string = "Some random markup <pre> and this stuff in the middle is what I'm after </pre> and there <pre> lots of these in one file </pre> which when I use Regexhero <pre> finds all the tags  </pre>"
    dim result = FindCode(input, "pre", false)
    dim count as integer = result.Count()
    Console.WriteLine(string.Format("Found string {0} times.", count))
    Console.WriteLine("Findings:")
    for each s in result
        Console.WriteLine(string.format("'{0}'", s.Value))
    next
End Sub

This will output:

Found string 2 times.

Findings:

' lots of these in one file '

' finds all the tags '

However, there is still one question left: Why isn't the first <pre>...</pre> matched ? Take a look at the substring I'm after - it contains ' which isn't matched because it is neither a whitespace nor alphanumeric. You can add it by specifying (\s|\w|') in the regular expression, then it will show all 3 strings.

Upvotes: 1

Teejay
Teejay

Reputation: 7469

I got the correct output (for the given regex), one match containing:

<pre> and this stuff in the middle is what I'm after </pre> and there </pre> lots of these in one file </pre> which when I use Regexhero <pre> finds all the tags </pre>

Aside the fact I suppose you meant <pre> (not </pre>) after and there...

Probably you you want to use (.+?) because + is greedy by default.


Also, it's not clear why (?m) and /B (and why at the end but not at start).

Upvotes: 1

Jon Skeet
Jon Skeet

Reputation: 1500775

I think the problem is (.+) - which is greedy by default, so it's matching as much as it possibly can - including intermediate </pre> parts.

If you change it to (.+?) you should get multiple entries. Then to find the text within the <pre> tag, you need to fetch the value of match.Groups[2]. The ? makes the .+ reluctant - it matches as few characters as it can.

Additionally, it's not clear what (?m) is meant to achieve here, by the way.

(Oh, and of course it's generally a bad idea to parse HTML using regular expressions...)

Upvotes: 3

Related Questions