Desolator
Desolator

Reputation: 22749

help with regular expression pattern to extract some text from html in C#

I have this html block:

<tr>
<th colspan="2" valign="middle">some text</th>
</tr>
<tr>
<td class="row1">lalala<span>dadada</span></td>
<td class="row2"><input name="unwantedinput"></td>
</tr>

<th colspan="2" valign="middle">some text</th>
</tr>
<tr>
<td class="row1">nanana<span>bababa</span></td>
<td class="row2"><input name="unwantedinput"></td>
</tr>


<tr>
<th colspan="2" valign="middle">Some other text</th>
</tr>
<tr>
<td class="row1">(this text needs to be extracted)</td>
<td class="row2"><input name="myUniqueInput"></td>
</tr>

<tr>
<th colspan="2" valign="middle">some text</th>
</tr>
<tr>
<td class="row1">lalala<span>dadada</span></td>
<td class="row2"><input name="unwantedinput"></td>
</tr>

what I need is to extract only the data between the "(this text needs to be extracted)".. here is what I've done so far:

<th[^>]*>(.*?)<input[^>]*name="myUniqueInput"[^>]*>

the problem with this pattern. its matching the whole text from the beginning till the "myUniqueInput".. any idea how to fix this? thanks in advance..

Upvotes: 1

Views: 247

Answers (2)

Johan Soderberg
Johan Soderberg

Reputation: 2740

/<td[^>]*>([^<]*)<[^>]*>\s*<td[^>]*>\s*<input[^>]*name="myUniqueInput"/

You can always match more/less depending if you know how the html will look. The idea is to skip td* before the input name. Then get everything between the previous td /td.

Upvotes: 1

Brian Willis
Brian Willis

Reputation: 23854

It's generally accepted that regular expressions aren't expressive enough to parse HTML correctly. Have you considered using a library to parse the HTML for you, and then extracting the data from there?

Upvotes: 0

Related Questions