Reputation: 58883
I have the following piece of text from which I'd like to extract all the <td ????>???</td>
tags
<tr id=row509>
<td id=serv509 align=center class='style1'>Z Deviazione Tecnico Home verso S24 [ NON USATO ]</td>
<td align=center class='style4'>23</td>
<td align=center class='style10'>22</td>
<td align=center class='style6'>0</td>
<td align=center class='style2'>0</td>
<td id=rowtot509 align=center class='style6'>0</td>
<td align=center class='style6'>0</td>
<td align=center class='style2'>0</td>
<td align=center class='style6'>0</td>
</tr>
The expected result would be:
1. <td id=serv509 align=center class='style1'>Z Deviazione Tecnico Home verso S24 [ NON USATO ]</td>
2. <td align=center class='style4'>23</td>
3. <td align=center class='style10'>22</td>
[..]
Any help? Thanks
Upvotes: 0
Views: 296
Reputation: 96720
Regular expressions are a pretty fragile tool to use for this kind of problem, especially if there's any risk at all that a table's cell content could be another table. (In that case, the first </td>
tag you find after a <td>
start tag may not actually be closing that element but a descendant element.)
A much more robust way to tackle problems like these is to parse the HTML into a DOM and then examine the DOM. The HTML Agility Pack is one that people seem to like.
Upvotes: 0
Reputation: 297195
What's the problem with using an HTML or XML library?
Using XML and XPath, for instance, this would just be a case of doing xml / td
, in whatever way the library API supports that.
Regex is a lousy way of doing that, because XMLs is not a regular language. Specifically, you can nest tags inside other tags, and this is something that can't be represented with regular expressions.
So, while it would be easy to create as regular expression for the simple case (<td.*?</td>
), it would easily break if the XML changed just a bit.
Granted that the XML is broken, but you may fix it with Regex. :-) For instance, if you replace the pattern (\w+)=(\w+)
in that with $1='$2'
(or \1='\2'
, if that's the syntax of c# replace patterns), you'll get a valid XML.
Upvotes: 2
Reputation: 2586
I would agree with Daniel, but if you really must use a regex - get yourself a copy of RegexBuddy so you can quickly debug your expression. Best $40 I've spent in a long time.
Upvotes: 0