Reputation: 6159
I would like to scrape name, address informations between tag contains defendent text and another tag,
My HTML structure is:
<hr>
<H5>Defendant/Respondent Information</H5>
<span class="InfoChargeStatement">(Each Defendant/Respondent is displayed below)</span>
<table>
<tr>
<td><span class="FirstColumnPrompt">Party Type:</span></td><td><span class="Value">Defendant</span><span class="Prompt">Party No.:</span><span class="Value">1</span></td>
</tr>
</table>
<table>
<tr>
<td><span class="FirstColumnPrompt">Name:</span></td><td><span class="Value">Name 1</span></td>
</tr>
</table>
<table>
<tr>
<td><span class="FirstColumnPrompt">Address:</span></td><td><span class="Value">Addr 1</span></td>
</tr>
<tr>
<td><span class="FirstColumnPrompt">City:</span></td><td><span class="Value">city1</span><span class="Prompt">State:</span><span class="Value">aa</span><span class="Prompt">Zip Code:</span><span class="Value">Zip1</span></td>
</tr>
</table>
<hr>
<table>
<tr>
<td><span class="FirstColumnPrompt">Party Type:</span></td><td><span class="Value">Defendant</span><span class="Prompt">Party No.:</span><span class="Value">2</span></td>
</tr>
</table>
<table>
<tr>
<td><span class="FirstColumnPrompt">Name:</span></td><td><span class="Value">Name 2</span></td>
</tr>
</table>
<table>
<tr>
<td><span class="FirstColumnPrompt">Address:</span></td><td><span class="Value">Addr2</span></td>
</tr>
<tr>
<td><span class="FirstColumnPrompt">City:</span></td><td><span class="Value">City2</span><span class="Prompt">State:</span><span class="Value">st2</span><span class="Prompt">Zip Code:</span><span class="Value">zip2</span></td>
</tr>
</table>
<hr>
<H5>Related Persons Information</H5>
<span class="InfoChargeStatement">(Each Related person is displayed below)</span>
<table>
<tr>
<td><span class="FirstColumnPrompt">Name:</span></td><td><span class="Value">Unwanted Name</span></td>
</tr>
</table>
<table>
<tr>
<td><span class="FirstColumnPrompt">Address:</span></td><td><span class="Value">un addr</span></td>
</tr>
<tr>
<td><span class="FirstColumnPrompt">City:</span></td><td><span class="Value">Unwanted City</span><span class="Prompt">State:</span><span class="Value">Unwanted city</span><span class="Prompt">Zip Code:</span><span class="Value">12345</span></td>
</tr>
</table>
<table></table>
<hr>
My current XPATH capturing the first occurence of Name and address properly, but if need to extract the multiple occurences, it also scrape the information from the unwanted h5 tags.
My current XPATH is,
"//*[contains(text(),'Defendant')]//following-sibling::table//span[text()='Name:' or text()='Business or Organization Name:']/ancestor-or-self::td/following-sibling::td//text()")
I tried including preceding sibling and following sibling but nothing gives my expected output,
My current output is..
names - [
Name1,
Name2
Unwanted Name,
]
Expected output is,
[
Name1
Name2
]
Kindly help.
Upvotes: 1
Views: 545
Reputation: 4834
try this:
"//H5[contains(text(),'Defendant')]/following-sibling::table[not(preceding-sibling::H5[not(contains(text(),'Defendant'))])]/tr[td[1][span[text()[.='Name:' ]]]]/td[2]/span/text()"
It first selects the table that has not
a preceding-sibling::h5
with text()
that not
contains
'Defendant'
and than
selects from the correct table the tr
where the first td
meets your requirements and selects the second td
No need for double slashes which is bad for performance
EDIT 1
Since there are more preceding-sibling::h5 than the example shows, this XPath
will deal with that:
"//H5[contains(text(),'Defendant')]/following-sibling::table[preceding-sibling::H5[1][contains(text(),'Defendant')]]//tr[td[1][span[text()[.='Name:' ]]]]/td[2]/span/text()"
This will only select those tables that have as there first preceding-sibling::h5 the same h5 as we were interested in
EDIT 2
Actually now the first h5
select is redundant. This XPath
will do:
"//table[preceding-sibling::H5[1][contains(text(),'Defendant')]]//tr[td[1][span[text()[.='Name:' ]]]]/td[2]/span/text()"
Upvotes: 1