Reputation: 295
I'm trying to extract a portion of html between 2 comments.
here is the test code:
Sub Main()
Dim base_dir As String = "D:\"
Dim test_file As String = base_dir & "72.htm"
Dim start_comment As String = "<!-- start of content -->"
Dim end_comment As String = "<!-- end of content -->"
Dim regex_pattern As String = start_comment & ".*" & end_comment
Dim input_text As String = start_comment & "some more html text" & end_comment
Dim match As Match = Regex.Match(input_text, regex_pattern)
If match.Success Then
Console.WriteLine("found {0}", match.Value)
Else
Console.WriteLine("not found")
End If
Console.ReadLine()
End Sub
The above works.
When I try to load actual data from disk the below code fails.
Sub Main()
Dim base_dir As String = "D:\"
Dim test_file As String = base_dir & "72.htm"
Dim start_comment As String = "<!-- start of content -->"
Dim end_comment As String = "<!-- end of content -->"
Dim regex_pattern As String = start_comment & ".*" & end_comment
Dim input_text As String = System.IO.File.ReadAllText(test_file).Replace(vbCrLf, "")
Dim match As Match = Regex.Match(input_text, regex_pattern)
If match.Success Then
Console.WriteLine("found {0}", match.Value)
Else
Console.WriteLine("not found")
End If
Console.ReadLine()
End Sub
The HTML file contains the start and end comments and a good amount of HTML in-between. Some content in the HTML file is in Arabic.
With thanks and regards.
Upvotes: 0
Views: 587
Reputation: 19500
Try passing in RegexOptions.Singleline
into Regex.Match(...)
like this:
Dim match As Match = Regex.Match(input_text, regex_pattern, RegexOptions.Singleline)
This will make the Dot's .
match newlines.
Upvotes: 2
Reputation: 324760
I don't know vb.net
, but does .
match newlines or is there an option you have to set for that? Consider using [\s\S]
instead of .
to include newlines.
Upvotes: 0