MoizNgp
MoizNgp

Reputation: 295

regex code to extract html between 2 comments in vb.net not working

I'm trying to extract a portion of html between 2 comments.

here is the test code:

Sub Main()

    Dim base_dir As String = "D:\"
    Dim test_file As String = base_dir & "72.htm"

    Dim start_comment As String = "<!-- start of content -->"
    Dim end_comment As String = "<!-- end of content -->"

    Dim regex_pattern As String = start_comment & ".*" & end_comment
    Dim input_text As String = start_comment & "some more html text" & end_comment 

    Dim match As Match = Regex.Match(input_text, regex_pattern)


    If match.Success Then
        Console.WriteLine("found {0}", match.Value)
    Else
        Console.WriteLine("not found")
    End If

    Console.ReadLine()

End Sub

The above works.

When I try to load actual data from disk the below code fails.

Sub Main()

    Dim base_dir As String = "D:\"
    Dim test_file As String = base_dir & "72.htm"

    Dim start_comment As String = "<!-- start of content -->"
    Dim end_comment As String = "<!-- end of content -->"

    Dim regex_pattern As String = start_comment & ".*" & end_comment
    Dim input_text As String = System.IO.File.ReadAllText(test_file).Replace(vbCrLf, "") 

    Dim match As Match = Regex.Match(input_text, regex_pattern)


    If match.Success Then
        Console.WriteLine("found {0}", match.Value)
    Else
        Console.WriteLine("not found")
    End If

    Console.ReadLine()

End Sub

The HTML file contains the start and end comments and a good amount of HTML in-between. Some content in the HTML file is in Arabic.

With thanks and regards.

Upvotes: 0

Views: 587

Answers (2)

Robbie
Robbie

Reputation: 19500

Try passing in RegexOptions.Singleline into Regex.Match(...) like this:

Dim match As Match = Regex.Match(input_text, regex_pattern, RegexOptions.Singleline)

This will make the Dot's . match newlines.

Upvotes: 2

Niet the Dark Absol
Niet the Dark Absol

Reputation: 324760

I don't know vb.net, but does . match newlines or is there an option you have to set for that? Consider using [\s\S] instead of . to include newlines.

Upvotes: 0

Related Questions