Sugitime
Sugitime

Reputation: 1888

Regex formula assistance

I'm trying to find a regex formula for these HTML nodes:

first: Need the inner html value

<span class="profile fn">Any Name Here</span>

second: need the title value

<abbr class="time published" title="2012-08-11T07:02:50+0000">August 10, 2012 at 5:02 pm</abbr>

third: need the inner html value

<div class="msgbody">
Some message here. might contain any character.
</div>

I'm fairly new to regex, and was hoping someone could offer me some guidance with this. I'll be using it with C# if that makes a difference.

Edit:

The HTML I'd be pulling this out of would look like this:

<div class="message">
<div class="from"><span class="profile fn">Name</span></div>
<abbr class="time published" title="2012-08-11T07:02:50+0000">August 10, 2012 at 5:02 pm</abbr>
<div class="msgbody">
Some message
</div>
</div>

Upvotes: 1

Views: 260

Answers (2)

Casimir et Hippolyte
Casimir et Hippolyte

Reputation: 89639

If your html is always the same you can use this ugly pattern:

"profile fn"[^>]*>(?<name>[^<]+)(?:[^t]+|t(?!itle=))+title="(?<time>[^"]+)(?:[^m]+|m(?!sgbody"))+msgbody">\s*(?<msg>(?:[^<\s]+|(?>\s+)(?!<))+)

results are in m.Groups["name"], m.Groups["time"], m.Groups["msg"]

Upvotes: 0

Russ Clarke
Russ Clarke

Reputation: 17919

A lot of people are quite dismissive of using Regular Expressions to deal with HTML; However, I believe that if your HTML is assuredly regular and well formatted then you can use Regex successfully.

If you can't assure then, then I urge you to check out the HTML Agility Pack, it's a library for parsing HTML in C# and works very well.

I'm not on my PC, but I'll edit my answer with a suggested regex for your examples, give you something to try at least.

For this one:

<span class="profile fn">Any Name Here</span>

Try

"<span.*?>(?<span>.*?)</span>"

Then you can access this via the Match.Groups("span") property of your regex result.

For the Abbr tag:

<abbr class="time published" title="2012-08-11T07:02:50+0000">...snip...</abbr>

It's similar

"<abbr.*?title=\"(?<title>.*?)\".*?>"

And lastly for the div:

<div class="msgbody">
Some message here. might contain any character.
</div>

Is:

"<div.*?>(?<div>.*?)</div>"

For this one, you may need to set the Multiline regex option.

The key point is the .*? operator.

Adding the question match turns a greedy match into a look ahead match, it tells the Regex engine to look forwards from the place it finds the match, rather then finding the last match and then working backwards; this is incredibly important for matching in HTML where you will have many Chevrons closing tags.

The big problem you'll get though is, what happens if the inner text or an attribute has an '<' or an '"' character in it? It's very hard to make Regex only match balanced <>'s and it can't easily not use ones that are in between quotes; this is why the Agility pack is often preferred.

Hope this helps anyway!

Edit:

How to use named capture groups

This syntax (?..selector..) tells the Regex engine to encapsulate whatever is between the brackets into a value that can be taken out the actual match object.

So for this HTML

<span>TEST</span>

You'd use this code:

string HTML = "<span>TEST</span>";
Regex r = new Regex("<span>(?<s>.*?)</span>");
var match = r.Match(HTML);

string stuff = match.Groups["s"].Value;
//stuff should = "TEST"

If you think you'll have multiple captures then you'd use a variant of this overload:

foreach (Match m in r.Matches(HTML))
{
   string stuff = m.Groups["s"].Value;
}

This should yield the answer you need.

Upvotes: 1

Related Questions