Regex formula assistance

Question

I'm trying to find a regex formula for these HTML nodes:

first: Need the inner html value

Any Name Here

second: need the title value

August 10, 2012 at 5:02 pm

third: need the inner html value


Some message here. might contain any character.

I'm fairly new to regex, and was hoping someone could offer me some guidance with this. I'll be using it with C# if that makes a difference.

Edit:

The HTML I'd be pulling this out of would look like this:


Name
August 10, 2012 at 5:02 pm

Some message

Russ Clarke · Accepted Answer

A lot of people are quite dismissive of using Regular Expressions to deal with HTML; However, I believe that if your HTML is assuredly regular and well formatted then you can use Regex successfully.

If you can't assure then, then I urge you to check out the HTML Agility Pack, it's a library for parsing HTML in C# and works very well.

I'm not on my PC, but I'll edit my answer with a suggested regex for your examples, give you something to try at least.

For this one:

Any Name Here

Try

"(?.*?)"

Then you can access this via the Match.Groups("span") property of your regex result.

For the Abbr tag:

...snip...

It's similar

".*?)\".*?>"

And lastly for the div:


Some message here. might contain any character.

Is:

"(?.*?)"

For this one, you may need to set the Multiline regex option.

The key point is the .*? operator.

Adding the question match turns a greedy match into a look ahead match, it tells the Regex engine to look forwards from the place it finds the match, rather then finding the last match and then working backwards; this is incredibly important for matching in HTML where you will have many Chevrons closing tags.

The big problem you'll get though is, what happens if the inner text or an attribute has an '<' or an '"' character in it? It's very hard to make Regex only match balanced <>'s and it can't easily not use ones that are in between quotes; this is why the Agility pack is often preferred.

Hope this helps anyway!

Edit:

How to use named capture groups

This syntax (?..selector..) tells the Regex engine to encapsulate whatever is between the brackets into a value that can be taken out the actual match object.

So for this HTML

TEST

You'd use this code:

string HTML = "TEST";
Regex r = new Regex("(?.*?)");
var match = r.Match(HTML);

string stuff = match.Groups["s"].Value;
//stuff should = "TEST"

If you think you'll have multiple captures then you'd use a variant of this overload:

foreach (Match m in r.Matches(HTML)) { string stuff = m.Groups["s"].Value; }

This should yield the answer you need.

Regex formula assistance

Answers (2)

Related Questions