Illuminati
Illuminati

Reputation: 548

Regex Match for HTML string with newline

I am trying to match:

    <h4>Manufacturer</h4>\n\n  Gigabyte\n\n\n

My Regex ATM is:

    Match regex = Regex.Match(cleanedUpHtml, "Manufacturer(.*?)\n\n\n", RegexOptions.IgnoreCase);

However it does not work.

The (.*?) should match all in between.

Upvotes: 2

Views: 1516

Answers (3)

Bastianon Massimo
Bastianon Massimo

Reputation: 1742

Generally I prefere to cleanup the string from html tags and new-line characters before using the regex.

(.*?) stops capture with \n characer, you might use a more generic group instead, like ([\w|\W]*?)

Upvotes: 0

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626728

Here are 2 things I find important:

  1. Whenever you declare a regex pattern in C#, it is advisable to use string literals, i.e. @"PATTERN". This simplifies writing regex patterns.

  2. RegexOptions.Singleline must be used to treat multiline text as a string, i.e. a dot will match a line break.

Here is my code snippet:

var str = "<h4>Manufacturer</h4>\n\n  Gigabyte\n\n\n";
var regex = Regex.Match(str, @"Manufacturer(.*?)\n\n\n", 
             RegexOptions.IgnoreCase | RegexOptions.Singleline);
if (regex.Success)
    MessageBox.Show("\"" + regex.Value + "\"");

The regex.Value is

"Manufacturer</h4>

  Gigabyte


"

Best regards.

Upvotes: 2

Illuminati
Illuminati

Reputation: 548

I replaced \n with another value and then Regex searched my replaced value. It is working for the time being, but it may not be the best approach. Any recommendations appreciated.

    cleanedUpHtml = cleanedUpHtml.Replace("\n", "p19o9");
    Match regex = Regex.Match(cleanedUpHtml, "Manufacturer(.*?)p19o9p19o9p19o9", RegexOptions.IgnoreCase);

Upvotes: 1

Related Questions