Reputation:
Data:
<tr>
<td>
<a href="somelink">
some. .data...
</a>
</td>
<td>Black</td>
<td>57234</td>
<td>5431.60</td>
<td><font class="down"> -125.02</font></td>
</tr><tr>
<td>
<a href="somelink">
some. .data...
</a>
</td>
<td>Blue</td>
<td>57234</td>
<td>5431.60</td>
<td><font class="up"> -125.02</font></td>
</tr><tr>
<td>
<a href="somelink">
some. .data...
</a>
</td>
<td>Brown</td>
<td>57234</td>
<td>5431.60</td>
<td><font class="down"> -125.02</font></td>
</tr>
...more data...
I want to extract 'some. .data...'; 'Black'; '57234'; '5431.60'; at one time. [fifth td
data is not required.]
Initially,
<tr><td><a.*>([a-zA-Z0-9 -]+)</a></td><td>(\w+)</td><td>([\d]+\.\d+)</td><td>(\d+\.\d+)</td>
was working. (via hit and miss approach)
But, now it's broke.
Now, when I use <td>(.*)</td>
or <\w+>(.*)</\w+>
: it shows data from last four td
s in every tr. But then, Why won't it show <a href...>...</a>
and how can I get data I want?
Upvotes: 0
Views: 247
Reputation: 499042
Regex is, in general, a bad way to parse HTML.
I suggest taking a look at the HTML Agility Pack or CsQuery that are purpose built HTML parsers for .NET.
The HTML Agility Pack can be queried using XPath and LINQ, and CsQuery uses jQuery selectors.
Upvotes: 6
Reputation: 35353
If you used a real html parser, your code would be simpler and easier to maintain
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
var table = doc.DocumentNode.Descendants("tr")
.Select(tr => tr.Descendants("td").Select(td => td.InnerText).ToList())
.ToList();
Given the sample html you provided, above code will return 3 rows each containing 5 columns.
Upvotes: 1