Reputation: 22749
I have this html block:
<tr>
<th colspan="2" valign="middle">some text</th>
</tr>
<tr>
<td class="row1">lalala<span>dadada</span></td>
<td class="row2"><input name="unwantedinput"></td>
</tr>
<th colspan="2" valign="middle">some text</th>
</tr>
<tr>
<td class="row1">nanana<span>bababa</span></td>
<td class="row2"><input name="unwantedinput"></td>
</tr>
<tr>
<th colspan="2" valign="middle">Some other text</th>
</tr>
<tr>
<td class="row1">(this text needs to be extracted)</td>
<td class="row2"><input name="myUniqueInput"></td>
</tr>
<tr>
<th colspan="2" valign="middle">some text</th>
</tr>
<tr>
<td class="row1">lalala<span>dadada</span></td>
<td class="row2"><input name="unwantedinput"></td>
</tr>
what I need is to extract only the data between the "(this text needs to be extracted)".. here is what I've done so far:
<th[^>]*>(.*?)<input[^>]*name="myUniqueInput"[^>]*>
the problem with this pattern. its matching the whole text from the beginning till the "myUniqueInput".. any idea how to fix this? thanks in advance..
Upvotes: 1
Views: 247
Reputation: 2740
/<td[^>]*>([^<]*)<[^>]*>\s*<td[^>]*>\s*<input[^>]*name="myUniqueInput"/
You can always match more/less depending if you know how the html will look. The idea is to skip td* before the input name. Then get everything between the previous td /td.
Upvotes: 1
Reputation: 23854
It's generally accepted that regular expressions aren't expressive enough to parse HTML correctly. Have you considered using a library to parse the HTML for you, and then extracting the data from there?
Upvotes: 0