Reputation: 331052
I have these strings in a string stream:
"do=whoposted&t=1934067" rel=nofollow>61</A></TD><TD class=alt2 align=middle>5,286</TD></TR><TR><TD id=td_threadstatusicon_1911046 class=alt1><IMG id=thread_statusicon_1911046 border=0 alt="" src="http://url.com/forum/images/statusicon/thread_new.gif"> </TD><TD class=alt2><IMG title=Film border=0 alt=Film src="http://url.com/forum/images/icons/new.png"></TD><TD id=td_threadtitle_1911046 class=alt1 title="http://lulzimg.com/i14/7bd11b.jpg Complete name : cool-thread.."><DIV><A id=thread_gotonew_1911046 href="http://url.com/forum/f80/cool-topic-new/"><IMG class=inlineimg title="Go to first new post" border=0 alt="Go to first new post" src="http://url.com/forum/images/buttons/firstnew.gif"></A> [MULTI] <A style="FONT-WEIGHT: bold" id=thread_title_1911046 href="http://url.com/forum/f80/cool-topic-name-1911046/">Cool Topic Name</A> </DIV><DIV class=smallfont><SPAN style="CURSOR: pointer" onclick="window.open('http://url.com/forum/members/u2031889/', '_self')">m3no</SPAN> </DIV></TD><TD class=alt2 title="Replies: 11, Views: 1,554"><DIV style="TEXT-ALIGN: right; WHITE-SPACE: nowrap" class=smallfont>Today <SPAN class=time>08:04 AM</SPAN><BR>by <A href="http://url.com/forum/members/u1131830/" rel=nofollow>karetsos</A> <A "
Currently I use this:
Regex pattern = new Regex ( "<A\\s+href=\"([^\"]*)\">([^\\x00]*?)\\s+id=thread_title_(\\S+)</A>" );
MatchCollection matches = pattern.Matches ( doc.ToString ( ) );
foreach ( Match match in matches )
{
int id = Convert.ToInt32 ( match.Groups [ 1 ].Value );
string name = match.Groups [ 3 ].Value;
string link = match.Groups [ 2 ].Value;
...
}
But it doesn't match anything.
All I am trying to extract are:
IDs: 942321
, 512147
.
Names: "Visible Thread Name"
, "Cool Thread"
Links: "http://url.com/forum/f80/new-topic-name-942321"
, "http://url.com/forum/f80/cool-topic-name-512147"
Any ideas on how to fix it?
Upvotes: 0
Views: 138
Reputation: 172270
List of issues that I found:
By default, regexes are case-sensitive (a != A). One possible solution is to pass RegexOptions.IgnoreCase as the second parameter to your Regex constructor.
id=thread...
you seem to be missing the opening "
after id
After matching the id you suddenly stop... don't you want to match the name as well in a third group? I guess your regex should end like this:
id=\"thread_title_([0-9]+)\">([^<]+)</a>
Oh, and don't close the a
tag after the href, because the thread_title_id is still inside the tag:
href=\"([^\"]*)\">
: remove the >
at the end
In addition, remove that strange [^\\x00]*?
group. What's that good for anyway?
After capturing the thread_title_id, you need to ignore stuff until the closing >
, in order to ignore the style=...
attribute.
Full solution (warning, spoiler ahead). The @"..."
syntax ensures that you don't need to escape backslashes (but you need to escape quotes by double quotes).
Regex pattern = new Regex (@"<a\s+href=""([^""]*)""\s+id=""thread_title_([0-9]+)""[^>]*>([^<]+)</a>");
BTW, for debugging this I used the following tool, which I can recommend and which automatically provides an escaped version:
Upvotes: 1
Reputation: 2340
This returns what you need. No need to be overly strict here:
<a.+href=".*topic\-name\-(\S+)\/.+thread_title_(\S+)"
Upvotes: 1